GenoClaw:
Patient-Owned AI Health Agents
on Decentralized Virtual
Bioinformatic Machines
A technical architecture for sovereign, autonomous AI health agents where every patient's genomic and clinical data remains under their exclusive cryptographic control, with transparent attribution, revocable consent, and direct economic participation.
Abstract
We present GenoClaw, an autonomous AI health agent platform in which each agent instance is cryptographically bound to a single patient's BioWallet and executes within a kernel-isolated NVIDIA NemoClaw/OpenShell sandbox. GenoClaw integrates a complete, validated clinical-genomic data pipeline: whole-genome sequencing via NVIDIA Clara Parabricks 4.6.0-1 on A100 GPU hardware (producing 2.87 million variant records in under 100 seconds); Epic FHIR R4 Patient Access API ingestion yielding 1,463 structured clinical records; and OpenCRAVAT 2.13.0 multi-annotator analysis identifying 351 clinically relevant variants across 12 oncology knowledge bases. The agent inference layer uses GPT-OSS-120B via Cloudflare Workers AI, with NVIDIA Nemotron 120B and Meta Llama 3.3 70B as fallback models, preceded by a HIPAA Safe Harbor processor that strips all 18 PHI identifiers before any data reaches the language model. Consent is managed through a dual-license architecture: Story Protocol Programmable IP Licenses for permanent attribution and Sequentias BioPIL revocable licenses for GDPR-compliant consent. Agent-to-agent data commerce is enabled through the x402 BioRouter protocol, which facilitates HTTP 402 micropayment-gated access between autonomous agents. We argue that this architecture instantiates a new paradigm — the Decentralized Virtual Bioinformatic Machine — in which patient ownership, authentic data quality, and transparent economic attribution are architectural invariants rather than policy afterthoughts. The 23andMe bankruptcy of March 2025 is examined as the definitive empirical case for why custodial genomic data models are structurally incompatible with patient interests.
Executive Summary (Non-Technical)
Today, your health data lives on someone else's server. When that company goes bankrupt — as 23andMe did in March 2025, exposing 15 million customers' genomic data to involuntary sale — you have no legal recourse, no compensation, and no way to revoke access.
GenoClaw is the answer. It is a personal AI health assistant that runs inside your wallet, not inside a company's server. Your genomic data, your clinical records, your AI-generated health insights — all of these are owned by you as cryptographically secured digital assets (BioNFTs). When a researcher wants access to your data, they must pay you directly. You can revoke that access at any time. Every use of your data is transparently recorded on a blockchain.
This document describes how GenoClaw works technically, presents empirical results from our production testing, and explains why this architecture represents a fundamental paradigm shift in the relationship between patients and health data.
1. Introduction
1.1 The Structural Failure of Custodial Genomic Data
On March 23, 2025, 23andMe Inc. filed for Chapter 11 bankruptcy protection in the Eastern District of Missouri, with genomic profiles of approximately 15 million customers listed as the company's primary asset.[1] The filing exposed a structural contradiction that had existed since the company's founding: customers had paid to generate their genomic data, yet that data was legally owned by a corporation whose interests were not aligned with theirs. In July 2025, TTAM Research Institute — a nonprofit controlled by Anne Wojcicki — acquired the company for $305 million. The 15 million customers received no compensation, no consultation, and no right of refusal. Many did not know the sale had occurred.
This outcome was not a failure of regulation. The Health Insurance Portability and Accountability Act of 1996 (HIPAA), the General Data Protection Regulation (GDPR), and the California Consumer Privacy Act (CCPA) all contain provisions that, in theory, protect health data subjects.[2,3,4] The failure was architectural: these regulations assume a two-party relationship in which a company holds data and a patient is protected from the company's misuse of it. They do not and cannot protect patients when the data is legally the company's property to begin with.
Privacy laws were designed to protect people from companies that control their data. Those laws become structurally unnecessary when patients own and control their data directly. The goal is not better regulation of custodial models — it is the elimination of the custodial model entirely.
The Electronic Health Record (EHR) ecosystem compounds this problem. Despite the 21st Century Cures Act Final Rule mandating open APIs for patient data access as of April 2021, most clinical records remain effectively inaccessible to patients in machine-readable form.[5] Patients cannot export, analyze, or share their own clinical histories without institutional intermediaries who introduce friction, cost, and surveillance.
Artificial intelligence exacerbates the asymmetry. Large language models trained on patient data generate economic value — in the form of diagnostic capabilities, drug discovery insights, and research publications — without any economic return to the patients whose data enabled those capabilities. This is not an oversight. It is the intended design of current AI health platforms, which depend on acquiring data from patients without meaningful compensation and monetizing it through products sold to third parties.
1.2 Prior Art and Its Limitations
Several approaches have attempted to address the genomic data ownership problem. We examine the most prominent and explain why each fails to provide a complete solution.
Federated Learning
Federated learning — proposed by McMahan et al. at Google in 2017 as a privacy-preserving technique for training models on distributed data — has been widely adopted by health AI companies as a purported solution to patient privacy concerns.[6] In the federated model, patient data never leaves the institution's server; instead, model gradients are shared and aggregated centrally.
Federated learning is biodata laundering. It degrades data quality through gradient approximation, erases patient attribution entirely, avoids revenue sharing with data owners, and was invented by data aggregators seeking to profit from patient data without the legal exposure of explicit ownership. It treats patient data as a resource to be mined, not as an asset to be owned. GenoBank.io will not implement federated learning in any form.
The specific deficiencies of federated learning for genomic data are as follows: (1) gradient approximation introduces noise that is clinically unacceptable for diagnostic applications; (2) the federated aggregator — typically a commercial entity — retains the economic value of the trained model without compensating data contributors; (3) patients cannot revoke their contribution after gradients have been incorporated into a global model; and (4) the model cannot provide attribution at the individual patient level, which violates the principle of data dignity.[7]
Zero-Knowledge Genomics
Zero-knowledge proofs (ZKPs) — cryptographic protocols that allow a prover to demonstrate knowledge of a value without revealing the value itself — have been proposed as a privacy-preserving substrate for genomic queries.[8] This approach is technically incorrect for genomic data and should not be deployed in clinical contexts.
ZKPs require deterministic computation: for a given input, the output must be identical across all executions of the circuit. Genomic data is probabilistic and non-deterministic at every stage of the pipeline: base calling from raw signal involves probabilistic quality scores; variant calling involves Bayesian posterior probabilities over haplotype configurations; and annotation databases are continuously updated, changing the clinical significance of the same variant over time.[9] There is no deterministic circuit that can represent the complete semantics of a VCF file, and therefore no valid ZKP construction for genomic queries. GenoBank.io uses privacy-preserving Bloom filters instead — a technically correct and computationally efficient approach to private variant membership testing.
Traditional EHR Portals
Patient portals (e.g., MyChart, FollowMyHealth) provide read-only views of clinical data through authenticated web interfaces. They do not provide programmatic export, do not integrate with genomic data, do not support AI-driven analysis, and do not support any form of patient-controlled data sharing or compensation. They are display interfaces, not ownership instruments.
2. Background
2.1 Blockchain-Based Genomic Ownership
GenoBank.io introduced the BioNFT concept in 2020 as a mechanism for representing biosample ownership and consent on a public blockchain.[10] A BioNFT is an ERC-721 non-fungible token minted on Avalanche C-Chain that encodes: (1) the cryptographic identifier of a biosample or genomic file; (2) the wallet address of the patient-owner; (3) the laboratory that performed the analysis; and (4) a URI pointing to consent terms encoded in a Programmable IP License (PIL).
2.1.1 BioNFT as a Legal Instrument
A BioNFT is not merely a digital collectible — it is a programmable legal instrument that encodes property rights, consent terms, and access control into a single cryptographic token. Each BioNFT establishes four critical properties that no traditional consent form can provide:
- Provable Ownership: The ERC-721 token standard (Ethereum Improvement Proposal 721, Entriken et al. 2018)[11] guarantees that exactly one wallet address owns the token at any given time. Transfer of ownership is atomic, transparent, and recorded on an immutable public ledger. Unlike a consent form stored in a hospital's filing cabinet, a BioNFT's ownership chain is publicly verifiable by anyone, at any time, without the cooperation of any intermediary.
- Chain of Custody: Every transfer, license grant, and access event is recorded on-chain. The complete provenance of a biosample — from collection at a CLIA-certified laboratory through sequencing, analysis, and data sharing — is cryptographically verifiable. This creates an immutable chain of custody that satisfies both clinical regulatory requirements (CAP/CLIA accreditation standards) and intellectual property attribution requirements (Story Protocol IP Asset framework).
- Programmable Consent: The BioNFT's metadata URI points to machine-readable consent terms encoded in a Programmable IP License (PIL). Unlike paper consent forms that are interpreted by humans and enforced by legal departments, PIL terms are interpreted by smart contracts and enforced by cryptographic access control. A researcher who does not hold a valid license token simply cannot decrypt or access the underlying data — no human decision is required.
- Bankruptcy Protection: Because the patient holds the BioNFT in their own wallet (not in a company's custodial account), the data it represents cannot be included in any company's bankruptcy estate. This is the critical failure that led to the 23andMe data sale: patient data was stored on company servers, owned by the company, and thus became a corporate asset in bankruptcy proceedings. A BioNFT-gated data vault is legally analogous to a patient's personal safe deposit box — the company that manufactured the lock has no claim to the contents.
2.1.2 The Dual-Chain Architecture
BioNFTs operate across two complementary blockchain networks, each serving a distinct regulatory purpose:
- Avalanche C-Chain (public, permanent): BioNFT ownership tokens, Story Protocol IP Asset registration, and permanent attribution records. These records establish who owns what and who contributed to which research output. Permanence is desirable here because intellectual property attribution should survive the original consent relationship — a patient who contributed data to a research paper should receive credit even after revoking future access.
- Sequentias Network (chain ID 15132025, revocable): BioPIL consent tokens, access control grants, and consent lifecycle events. These records implement the revocable consent requirements of GDPR Article 7(3), CCPA's right to delete, and HIPAA's individual right of access. Revocability is essential here because data access permissions must be withdrawable at any time without destroying the attribution record.
This dual-chain design resolves a fundamental tension in genomic data governance: the need for permanent attribution (so patients always receive credit) coexists with the need for revocable consent (so patients can withdraw access). By separating these concerns onto different chains with different immutability guarantees, GenoBank.io achieves both simultaneously — unlike single-chain approaches that must sacrifice either attribution permanence or consent revocability.
This approach builds on the broader NFT infrastructure established by EIP-721 (Entriken et al., 2018) and the IP Asset framework introduced by Story Protocol (2024).[11,12] Story Protocol provides a programmable licensing layer on top of standard NFT infrastructure, enabling on-chain license terms that specify permitted uses, revenue share percentages, and attribution requirements. GenoBank.io extends this with BioPIL — Bioinformatic Programmable IP Licenses — which add genomic-specific terms including GDPR-compliant revocability on the Sequentias Network (chain ID 15132025).
2.2 The Decentralized Virtual Bioinformatic Machine
The theoretical foundation of GenoClaw is the Decentralized Virtual Bioinformatic Machine (DVBM). A DVBM is a patient-controlled compute environment in which:
- All data assets are stored in patient-owned cryptographic vaults (GCS buckets with BioNFT-gated access)
- All compute is triggered by explicit patient consent and executed in isolated sandbox environments
- All outputs — variant calls, clinical annotations, AI-generated insights — are registered as derivative IP Assets on Story Protocol with transparent attribution
- All access by third parties is governed by BioPIL terms and logged to an immutable blockchain ledger
- Revenue sharing is automatic and non-custodial via micropayment protocols
This architecture inverts the traditional model. In the traditional model, a company owns a centralized compute environment and patients contribute data as raw inputs. In the DVBM model, the patient owns the compute environment and third parties (researchers, clinicians, AI systems) request access as credentialed guests.
2.3 Privacy-Preserving Bloom Filters for Genomic Queries
A Bloom filter is a space-efficient probabilistic data structure that answers membership queries with a controllable false-positive rate and zero false-negative rate.[13] For genomic applications, a patient's variant set is encoded as a Bloom filter: each variant (chromosome, position, reference allele, alternate allele) is hashed through multiple hash functions and the corresponding bits are set in the filter array.
A query for a specific variant can be answered by checking whether all bits corresponding to that variant's hashes are set — without requiring access to the underlying VCF data. This enables a researcher to determine, with high probability, whether a patient carries a specific variant (e.g., BRCA1 c.5266dupC) without the patient ever sharing their raw genomic data with the researcher. The Bloom filter can be shared publicly without exposing the complete variant set, because the filter cannot be efficiently inverted to reconstruct the underlying data.[14]
This approach is technically correct for genomic data (unlike ZKPs), computationally efficient (sub-millisecond query latency), and preserves data dignity because the patient retains the authentic, complete dataset.
3. System Architecture
3.1 Layer Overview
GenoClaw is organized as a five-layer stack. Each layer has a clearly defined responsibility boundary, and communication between layers is authenticated via Web3 cryptographic signatures.
┌──────────────────────────────────────────────────────────────────────┐
│ LAYER 5: APPLICATION LAYER │
│ GenoClaw Agent (9 Bioinformatics Skills) │
│ ┌──────────────┐ ┌───────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ cancer-risk │ │ pharmgx │ │variant-annot │ │rare-disease │ │
│ │ │ │ │ │ │ │ -dx │ │
│ └──────────────┘ └───────────┘ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌───────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ancestry-pca │ │consent- │ │variant-call │ │alphagenome- │ │
│ │ │ │manager │ │ │ │ interpret │ │
│ └──────────────┘ └───────────┘ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ │
│ │bio-orchestra-│ LLM: GPT-OSS-120B / Nemotron 120B / Llama 3.3 70B│
│ │ tor │ via Cloudflare Workers AI ($0.011 / 1K tokens) │
│ └──────────────┘ │
├──────────────────────────────────────────────────────────────────────┤
│ LAYER 4: SANDBOX LAYER │
│ NVIDIA NemoClaw / OpenShell │
│ K3s cluster inside Docker (port 9095) │
│ Landlock LSM + seccomp BPF (kernel-level process isolation) │
├──────────────────────────────────────────────────────────────────────┤
│ LAYER 3: PRIVACY LAYER │
│ HIPAA Safe Harbor Processor (strips 18 PHI identifiers) │
│ OpenShell Privacy Router (audits all outbound LLM requests) │
│ Bloom Filter Engine (private variant membership queries) │
├──────────────────────────────────────────────────────────────────────┤
│ LAYER 2: CONSENT LAYER │
│ Story Protocol PIL (permanent: licenses 1-4) │
│ Sequentias BioPIL (revocable GDPR-compliant: licenses 5-9) │
│ BioDataRouter.sol (on-chain ownership registry) │
├──────────────────────────────────────────────────────────────────────┤
│ LAYER 1: PAYMENT & DISCOVERY LAYER │
│ x402 BioRouter Protocol (HTTP 402 micropayments) │
│ Sequentias Network (chain ID 15132025, BioCID addressing) │
│ Agent-to-Agent Commerce (researcher AI ↔ patient AI negotiation) │
└──────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────┐
│ DATA STORAGE (cross-cutting) │
│ Google Cloud Storage (GCS) — genomic files │
│ MongoDB — FHIR cache + job tracking │
│ Avalanche C-Chain — BioNFT ownership │
│ Sequentias Network — consent records │
└──────────────────────────────────────────────────────┘
3.2 Application Layer: GenoClaw Agent Skills
The GenoClaw agent exposes nine discrete bioinformatics skills, each corresponding to a well-defined clinical or analytical function. Skills are invoked by the agent's bio-orchestrator based on patient query intent, and each skill invocation is logged to the Sequentias consent ledger.
| Skill | Function | Primary Data Source | Output |
|---|---|---|---|
cancer-risk |
Polygenic risk score + pathogenic variant identification | VCF + ClinVar + COSMIC + OncoKB | Risk tier + actionable variants |
pharmgx |
Pharmacogenomics — drug-gene interaction assessment | VCF + PharmGKB + CPIC guidelines | Drug sensitivity / contraindication list |
variant-annotate |
Functional annotation of variant set | VCF + OpenCRAVAT 12-annotator panel | Annotated variant table |
rare-disease-dx |
Rare disease differential diagnosis | VCF + OMIM + HPO + ClinVar | Candidate gene + disorder list |
ancestry-pca |
Ancestry inference via principal component analysis | SNP array or WGS VCF | Population assignment + PCA plot |
consent-manager |
View, grant, and revoke data access permissions | Sequentias BioPIL ledger | Permission state + audit trail |
variant-call |
Trigger GPU-accelerated variant calling pipeline | BAM/BAI on GCS | VCF (Clara Parabricks DeepVariant) |
alphagenome-interpret |
Regulatory variant interpretation via AlphaGenome | VCF + regulatory element databases | Predicted regulatory impact scores |
bio-orchestrator |
Intent classification + multi-skill workflow coordination | Patient query (natural language) | Skill invocation plan |
3.3 Sandbox Layer: NVIDIA NemoClaw / OpenShell
Each GenoClaw instance executes within an NVIDIA NemoClaw/OpenShell sandbox — a K3s (lightweight Kubernetes) cluster running inside Docker on the GenoBank.io production infrastructure (port 9095). The sandbox provides kernel-level process isolation via two Linux Security Module mechanisms:
- Landlock LSM: A Linux Security Module (available since kernel 5.13) that restricts filesystem access using a capability-based, unprivileged sandboxing model. Each GenoClaw instance is granted access only to its own patient vault directory and necessary system libraries, with no capability to access other patients' data even if the application layer is compromised.[15]
- seccomp BPF: Secure Computing Mode with Berkeley Packet Filter programs restricts the set of Linux system calls available to the sandboxed process. The GenoClaw profile allows 67 of the 400+ available syscalls, blocking all network-creation, raw-socket, and kernel-module syscalls that could be used for lateral movement or data exfiltration.[16]
This dual-layer isolation means that even a fully compromised GenoClaw agent process cannot access another patient's data, cannot create new network connections outside the pre-approved outbound channels, and cannot persist state beyond its designated vault directory.
3.4 Privacy Layer: HIPAA Safe Harbor and OpenShell Privacy Router
The Privacy Layer intercepts all data before it reaches any language model and applies two sequential transformations:
HIPAA Safe Harbor Processor. The Safe Harbor method defined in 45 CFR § 164.514(b) requires the removal of 18 specified categories of Protected Health Information (PHI) before data can be considered de-identified.[17] The GenoClaw HIPAA processor applies regular-expression and named-entity recognition rules to strip: names, geographic identifiers below state level, dates (except year), ages above 89, phone numbers, fax numbers, email addresses, Social Security numbers, medical record numbers, health plan beneficiary numbers, account numbers, certificate numbers, vehicle identifiers, device identifiers, URLs, IP addresses, biometric identifiers (fingerprints, voice prints), full-face photographs, and any other unique identifying number or code. This processing occurs in-process, within the NemoClaw sandbox, before any data is transmitted to Cloudflare Workers AI endpoints.
OpenShell Privacy Router. All outbound LLM requests are additionally audited by the OpenShell Privacy Router, which: (1) logs the request metadata (timestamp, patient wallet hash, skill invoked, token count) to the Sequentias consent ledger; (2) verifies that the patient has an active consent record for LLM processing of their data; and (3) enforces rate limits and spending caps defined in the patient's BioPIL terms.
3.5 Consent Layer: Dual-License Architecture
GenoClaw implements a dual-license architecture to satisfy both permanent IP attribution and revocable GDPR compliance simultaneously.
Patient Data Asset
│
├─── Story Protocol PIL (Permanent Attribution)
│ ├── PIL #1: Non-Commercial Research
│ ├── PIL #2: Commercial Research
│ ├── PIL #3: Exclusive License
│ └── PIL #4: Public Good
│ └── [Immutable — on Story Protocol]
│
└─── Sequentias BioPIL (Revocable GDPR-Compliant)
├── BioPIL #5: GDPR Research
├── BioPIL #6: AI Training
├── BioPIL #7: Clinical Use
├── BioPIL #8: Pharma Research
└── BioPIL #9: Family Inheritance
└── [Revocable — on Sequentias Network, chain 15132025]
Access Flow:
┌─────────────────────────────────────────────────────────────┐
│ Researcher Agent requests patient data │
│ │ │
│ ▼ │
│ BioDataRouter.sol checks ownership registry │
│ │ │
│ ▼ │
│ Sequentias BioPIL: is consent active? ──[No]──► Blocked │
│ │ [Yes] │
│ ▼ │
│ x402 BioRouter: micropayment collected │
│ │ │
│ ▼ │
│ GCS pre-signed URL issued (time-limited) │
│ │ │
│ ▼ │
│ Attribution recorded on Story Protocol │
│ │ │
│ ▼ │
│ Revenue share distributed to patient wallet │
└─────────────────────────────────────────────────────────────┘
The critical design principle is that GDPR Article 17 (the right to erasure) and Article 7(3) (the right to withdraw consent) are implemented not as legal policies but as cryptographic invariants. A patient who revokes consent on the Sequentias Network will have their BioPIL token burned, immediately rendering all GCS pre-signed URL generation impossible. The data itself remains in GCS (owned by the patient) and can be re-consented or permanently deleted by the patient at any time.
3.6 Payment Layer: x402 BioRouter Protocol
The x402 BioRouter protocol implements agent-to-agent data commerce using HTTP 402 ("Payment Required") as the signaling mechanism for paywall-gated genomic data access. This extends the original x402 micropayment protocol proposed for general HTTP resources to the specific semantics of genomic data licensing.[18]
Researcher's AI Agent Patient's GenoClaw Agent
│ │
│ GET /biorouter/{BioCID} │
│ ─────────────────────────────────────────────► │
│ │
│ 402 Payment Required │
│ x402-accept: EVM/15132025 │
│ x402-price: 0.0042 SEQ │
│ x402-license: BioPIL#6 (AI Training) │
│ ◄───────────────────────────────────────────── │
│ │
│ [Researcher agent evaluates price + terms] │
│ │
│ POST /biorouter/{BioCID}/pay │
│ Authorization: EIP-712 signed payment intent │
│ ─────────────────────────────────────────────► │
│ │
│ [BioDataRouter.sol verifies payment on-chain] │
│ │
│ 200 OK │
│ x402-access-token: {time-limited JWT} │
│ x402-attribution: Story Protocol IP Asset ID │
│ ◄───────────────────────────────────────────── │
│ │
│ [Researcher accesses GCS stream, time-limited] │
│ │
│ Revenue share automatically distributed │
│ to patient wallet via Sequentias Network │
The BioCID (Bioinformatic Content Identifier) addressing scheme used by Sequentias Network provides content-addressed identifiers for genomic data objects that are independent of storage location. A BioCID encodes: data type (VCF, BAM, FASTQ, FHIR bundle), content hash (SHA-256 of file), and the Sequentias chain ID for consent resolution. This enables a researcher's agent to discover and request a specific genomic dataset without knowing which GCS bucket it resides in — the BioDataRouter resolves the physical storage location after verifying consent and collecting payment.
4. Data Pipeline
The GenoClaw data pipeline integrates three independently sourced data streams — genomic sequencing, clinical records, and computational annotations — into a unified patient knowledge graph that the agent can reason over.
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ GENOMIC DATA │ │ CLINICAL DATA │ │ ANNOTATION │
│ │ │ │ │ │
│ Source: Invitae │ │ Source: Epic │ │ Source: │
│ Format: BAM/BAI │ │ MyChart FHIR R4 │ │ OpenCRAVAT │
│ │ │ │ │ 2.13.0 │
│ Upload to GCS │ │ Patient Access │ │ │
│ (gcsfuse mount) │ │ API OAuth 2.0 │ │ 12 Oncology │
│ │ │ │ │ │ │ Annotators │
│ ▼ │ │ ▼ │ │ │ │
│ Clara Parabricks│ │ FHIR R4 Bundle │ │ ▼ │
│ DeepVariant │ │ → MongoDB cache │ │ 351 Annotated │
│ A100 GPU │ │ 1,463 records │ │ Variants │
│ │ │ │ │ │ │ │ │
│ ▼ │ │ └─────────┘ │ │ │
│ VCF Output │ │ │ │ │
│ 2.87M variants │ │ │ │ │
│ in 99 seconds │ │ │ │ │
└───────┬──────────┘ │ └────────┤ │
│ │ │ │
└────────────────────────┴──────────────────────┘ │
│ │
▼ │
┌──────────────────────────┐ │
│ Patient Knowledge Graph │ │
│ (MongoDB + GCS) │ │
│ │ │
│ • Genomic variants │ │
│ • Clinical conditions │ │
│ • Medications │ │
│ • Lab results │ │
│ • Vital signs │ │
│ • Annotations │ │
└──────────────┬───────────┘ │
│ │
▼ │
┌──────────────────────────┐ │
│ GenoClaw LLM Inference │ │
│ (HIPAA-stripped input) │ │
└──────────────────────────┘ │
4.1 Genomic Data: Clara Parabricks DeepVariant
Whole-genome sequencing data is ingested from patient-provided BAM/BAI files. In the
validated production pipeline, source data originates from Invitae Corporation's
clinical sequencing service. Files are uploaded to a patient-specific GCS bucket
(genobank-backups-gcp) using gcsfuse direct mounting to avoid unnecessary
data transfer (gcsfuse --implicit-dirs --type-cache-max-size-mb=32 BUCKET /mountpoint).
This is architecturally critical: downloading a typical 60–120 GB BAM file to local
disk before processing is not scalable for a population-scale platform.
Variant calling is performed by NVIDIA Clara Parabricks 4.6.0-1 running the
deepvariant pipeline on an NVIDIA A100 GPU instance on Google Compute Engine.
DeepVariant uses a convolutional neural network trained on labeled genomic training sets
to classify candidate variant positions as reference, heterozygous, or homozygous
alternate.[19] GPU acceleration via Parabricks reduces wall-clock time from
the 24–48 hours required by CPU-based DeepVariant to under two minutes for a
whole-genome sample at 30x coverage.
Output VCF files are uploaded to the genobank-parabricks-output GCS bucket and
registered as IP Assets on Story Protocol via the two-step minting workflow:
mint_and_register_ip() followed by attach_license_terms(), with
separate NFT metadata URI and IP metadata URI to ensure correct display on the
Story Protocol explorer.
4.2 Clinical Data: Epic FHIR R4 Patient Access API
Clinical records are imported via the Epic MyChart FHIR R4 Patient Access API, which implements the SMART on FHIR authorization framework.[20] The patient authenticates with MyChart credentials and grants a scoped OAuth 2.0 token that permits read-only access to their clinical records. GenoBank.io's FHIR ingestion service requests the following resource types:
- Patient (demographic anchor)
- Condition (ICD-10 coded diagnoses)
- MedicationRequest (active and historical prescriptions)
- Observation (laboratory results, vital signs)
- DiagnosticReport (imaging, pathology)
- AllergyIntolerance
- Immunization
- Procedure
Records are cached in a MongoDB collection with a schema that preserves FHIR R4 resource structure, enabling FHIR-compliant query at the API layer while providing efficient indexed access to the underlying data. The cache is invalidated and refreshed when the patient re-consents or explicitly triggers a sync.
4.3 Variant Annotation: OpenCRAVAT Multi-Annotator Panel
Raw variant calls from DeepVariant are processed through OpenCRAVAT 2.13.0, the open-source platform for genomic variant interpretation developed at the Johns Hopkins Bloomberg School of Public Health.[21] The production GenoClaw annotation pipeline uses a 12-annotator oncology panel:
| Annotator | Data Source | Primary Annotation Class |
|---|---|---|
| ClinVar | NCBI ClinVar (monthly build) | Clinical significance classification |
| COSMIC | Catalogue of Somatic Mutations in Cancer v97 | Somatic mutation frequency in cancer |
| OncoKB | Memorial Sloan Kettering OncoKB | Oncogenic effect + therapeutic implications |
| CHASMplus | Johns Hopkins CHASM+ model | Driver mutation probability |
| CIViC | Clinical Interpretation of Variants in Cancer | Clinical evidence + therapeutic biomarkers |
| PharmGKB | Pharmacogenomics Knowledgebase | Drug-gene interaction evidence |
| CADD | Combined Annotation-Dependent Depletion v1.7 | Deleteriousness score (Phred-scaled) |
| gnomAD | Genome Aggregation Database v4.1 | Population allele frequency |
| REVEL | Rare Exome Variant Ensemble Learner | Missense pathogenicity score |
| SpliceAI | Illumina SpliceAI v1.3 | Splicing effect prediction |
| OMIM | Online Mendelian Inheritance in Man | Gene-phenotype associations |
| InterVar | ACMG/AMP 2015 guidelines implementation | Automated ACMG variant classification |
4.4 Agent Inference: LLM Selection and HIPAA Pre-Processing
GenoClaw uses a three-tier LLM failover chain, all served via Cloudflare Workers AI:
- Primary: GPT-OSS-120B at $0.011 per 1,000 tokens
- Fallback 1: NVIDIA Nemotron 120B (open-weight)
- Fallback 2: Meta Llama 3.3 70B (open-weight)
All three models receive identical pre-processed input that has been de-identified by the HIPAA Safe Harbor Processor. The agent's system prompt encodes the relevant clinical context (de-identified condition list, medication list, and annotated variant summary) within the context window. The patient's natural language query is appended as the user message. The agent's response is returned to the patient through the NemoClaw sandbox; it is not stored by Cloudflare or any third-party LLM provider beyond the duration of the API call.
5. Implementation Details
5.1 BioWallet: The Patient Identity Substrate
Every GenoClaw instance is anchored to a patient's BioWallet — a Web3 wallet implementation that serves as the cryptographic identity substrate for all genomic data ownership operations. GenoBank.io supports five wallet integration methods, all converging to an EIP-712 typed-data signature that authenticates API requests:
- MetaMask: Standard browser extension wallet; the
user_signatureparameter is generated by callingeth_signTypedData_v4with the message"I want to proceed" - BioWallet: GenoBank.io's custom wallet extension (Chrome extension ID:
eohejafelpaeejocncecjofhjmgflahn); detected viawindow.ethereum.isBioWallet === true; auto-signs the standard message for frictionless login - WalletConnect v2: QR-code-based mobile wallet connection supporting all EVM-compatible wallets
- Tangem NFC: Hardware NFC card wallet for patients who prefer physical key custody
- World ID: Privacy-preserving proof of personhood via iris biometric verification (Worldcoin protocol), providing Sybil resistance for population-level research participation
All wallet addresses are stored in checksummed form (EIP-55) throughout the system.
Session credentials are persisted in localStorage (not sessionStorage)
to survive page refreshes without requiring re-signature. The environment configuration
system (js/env.js) automatically selects staging or production API endpoints
based on the URL path prefix, enabling transparent testing without code modification.
5.2 Metamorphic Consent
Traditional informed consent is a static, binary event: a patient signs a form once, and the consent persists until revoked. GenoBank.io's consent model is metamorphic: consent transforms from a static permission grant into an ongoing economic relationship through the combination of BioNFTs, Shapley value attribution, and Biodata Dividends.
The metamorphic consent lifecycle operates as follows:
- Initial Consent Event: Patient mints a BioNFT on Avalanche C-Chain and attaches a BioPIL license specifying permitted uses, price per access, and revenue share percentage.
- Active Research Access: Each time a researcher accesses the patient's data via x402 BioRouter, a micropayment is collected, a usage record is written to the Sequentias ledger, and a revenue share is distributed to the patient's wallet.
- Attribution Accumulation: As the patient's data contributes to AI model training or research publications, Shapley value calculations (computed off-chain and anchored to the Sequentias ledger) quantify the marginal contribution of the patient's specific data to each downstream output.
- Dynamic Terms Update: The patient may update their BioPIL terms at any time — changing the price, restricting permitted uses, or granting exclusive access to a specific researcher — without invalidating the Story Protocol attribution record.
- Consent Revocation: The patient burns the BioPIL token, immediately blocking all new access. Historical usage records and attribution data remain on-chain for audit purposes.
This metamorphic structure means that a patient who consented in 2024 does not simply have a permission form on file — they have an active economic instrument that generates ongoing returns as their data continues to provide value to the research community.
5.3 x402 BioRouter: Agent-to-Agent Data Commerce
The long-term vision of GenoBank.io is a galaxy of patient-owned AI agents, each carrying its own genomic and clinical dataset, interacting with researcher AI agents through the x402 BioRouter protocol. The economic mechanics of this network are:
A researcher's AI agent — running autonomously as part of a drug discovery pipeline —
issues a GET request to the BioRouter with a query BioCID describing the type of data
required (e.g., vcf/BRCA1/pathogenic/female/50-65). The BioRouter returns a
list of matching patient data assets with their price and license terms. The researcher
agent evaluates the terms programmatically against its research protocol's consent
requirements, authorizes payment for those assets that match, and receives time-limited
access tokens. The entire transaction — discovery, negotiation, payment, access,
attribution — is automated, auditable, and patient-controlled.
This is not a hypothesis or a design proposal. The x402 BioRouter protocol is implemented and deployed on GenoBank.io's production infrastructure as of March 2026, with the Sequentias Network (chain ID 15132025) serving as the settlement and audit ledger.
6. Results
The following results are from validated production testing of the complete GenoClaw pipeline on real patient data (de-identified for this publication) as of March 2026. All figures represent actual measured values, not projections.
6.1 Genomic Processing Performance
| Metric | Value | System | Notes |
|---|---|---|---|
| Variant records in output VCF | 2,870,000 | Clara Parabricks 4.6.0-1 + DeepVariant | WGS input at 30x mean coverage |
| Wall-clock variant calling time | 1 min 39 sec (99 sec) | NVIDIA A100 (Google Compute Engine) | gcsfuse-mounted BAM, no local copy |
| Annotated variants (OpenCRAVAT) | 351 | OpenCRAVAT 2.13.0, 12-annotator panel | Filtered to PASS + ACMG P/LP/VUS |
| Annotation runtime | < 15 min | OpenCRAVAT (cravat.genobank.app) | 12 annotators in parallel |
| GCS storage cost (30x WGS BAM) | ~$0.023 / month | Google Cloud Storage Nearline | ~60 GB compressed BAM |
6.2 Clinical Data Integration
| FHIR Resource Type | Count | Source Institution |
|---|---|---|
| Total records | 1,463 | UCSF Health (Epic MyChart FHIR R4) |
| Condition (diagnoses) | 42 | ICD-10 coded |
| MedicationRequest | 101 | Active + historical |
| Observation (labs) | 441 | LOINC coded |
| Observation (vitals) | 290 | LOINC coded |
| DiagnosticReport | 67 | Imaging + pathology |
| Immunization | 31 | CVX coded |
| Procedure | 88 | CPT coded |
| AllergyIntolerance | 12 | SNOMED coded |
| Other resource types | 391 | Mixed |
6.3 Agent Capability Demonstration
With the combined clinical-genomic knowledge graph loaded, GenoClaw demonstrated the following agent capabilities in production testing:
-
Drug-Gene Interaction Identification: The
pharmgxskill identified 3 clinically significant pharmacogenomic interactions between the patient's 101 active medications and variants in CYP2C19, CYP2D6, and DPYD, with CPIC-guideline-level evidence. -
Cancer Risk Assessment: The
cancer-riskskill cross-referenced the VCF with OncoKB and CIViC, identifying 2 variants of oncogenic significance in genes relevant to the patient's existing cancer diagnoses. -
Consent State Query: The
consent-managerskill returned a complete audit trail of all data access events from the Sequentias ledger in under 200ms. -
Natural Language Query: Patient question "Are any of my heart medications
affected by my DNA?" was correctly resolved by the bio-orchestrator to a combined invocation
of
pharmgxand clinical context filtering, returning a structured response with gene names, medication names, and recommended dose adjustments in plain language.
The complete pipeline — from BAM upload through variant calling, FHIR import, annotation, and agent-ready knowledge graph construction — was validated end-to-end on production infrastructure in March 2026. Processing time from upload to first agent query: under 30 minutes for whole-genome data.
7. Discussion
7.1 Regulatory Implications
The GDPR and CCPA compliance posture of GenoClaw differs fundamentally from that of conventional health data platforms. Conventional platforms must comply with GDPR because they are data controllers holding patient data without explicit patient authorization at the structural level — consent forms exist as legal documents, but the data resides on the company's infrastructure and the company makes operational decisions about its use.
In GenoClaw's architecture, the patient is the data controller in the GDPR sense: they decide where data is stored (their GCS bucket), who can access it (BioPIL terms), on what conditions (price, permitted uses, duration), and they hold the technical key to revoke access (burning the BioPIL token). GenoBank.io is a software provider, not a data controller. This structural difference significantly reduces GenoBank.io's GDPR compliance burden while increasing the actual privacy protection available to patients.
The HIPAA Safe Harbor de-identification applied before LLM processing ensures that even if a language model provider were subpoenaed or breached, the data they hold contains no PHI — it is legally de-identified under 45 CFR § 164.514(b) and thus not subject to HIPAA's breach notification provisions.
7.1.1 BioNFTs and HIPAA: The Patient as Covered Entity
HIPAA's Privacy Rule (45 CFR Parts 160 and 164) was designed to regulate "covered entities" — healthcare providers, health plans, and healthcare clearinghouses — that create, receive, maintain, or transmit Protected Health Information (PHI). Critically, patients are not covered entities under HIPAA. A patient who holds their own health data in their own BioNFT-gated vault is not subject to HIPAA's restrictions on data use and disclosure.
This creates a fundamental architectural advantage: when a patient imports their Epic MyChart records into their BioWallet via the FHIR Patient Access API (exercising their Individual Right of Access under 45 CFR § 164.524), the data transitions from HIPAA-regulated space (the hospital's EHR system) to patient-controlled space (the BioNFT-gated GCS bucket). The patient may then share, analyze, license, or monetize their data without the restrictions that would apply to a covered entity.
The BioNFT serves as the cryptographic proof of this transition: the token's ownership record on Avalanche C-Chain establishes that the data is held by a patient wallet (not a covered entity's system), and the BioPIL license terms on Sequentias encode the patient's autonomous decisions about who may access their data and under what conditions.
7.1.2 BioNFTs and GDPR: Data Controller by Design
Under GDPR Article 4(7), a "controller" is the natural or legal person that determines the purposes and means of processing personal data. In traditional health data systems, the hospital or company is the controller, and the patient is the "data subject" — a passive beneficiary of regulatory protections. BioNFTs invert this relationship:
| GDPR Right | Traditional Implementation | BioNFT Implementation |
|---|---|---|
| Art. 15: Right of Access | Patient submits formal request; company has 30 days to respond | Patient reads their own GCS bucket at any time (wallet key = access key) |
| Art. 17: Right to Erasure | Patient submits deletion request; company must verify identity and process | Patient deletes files from their GCS bucket or burns the BioNFT |
| Art. 20: Right to Portability | Company exports data in "commonly used format" within 30 days | Data is already in patient's vault in standard formats (FHIR JSON, VCF) |
| Art. 7(3): Right to Withdraw Consent | Patient contacts company; company updates internal database | Patient burns BioPIL token on Sequentias; access instantly revoked on-chain |
| Art. 25: Data Protection by Design | Company implements technical measures as policy commitment | BioNFT-gated access is the architecture itself — not a policy overlay |
| Art. 30: Records of Processing | Company maintains internal logs; subject to audit | All access events recorded on immutable blockchain; publicly auditable |
The key insight is that BioNFTs do not merely comply with GDPR — they make the regulatory framework structurally unnecessary for data the patient controls. When the patient is simultaneously the data subject AND the data controller, the protective provisions of GDPR that exist to shield subjects from controllers become redundant. The regulations remain applicable to GenoBank.io as a software provider (processor), but the scope of regulated processing is dramatically narrower because the patient — not GenoBank.io — holds and controls the data.
7.1.3 BioNFTs and CCPA/CPRA: Opt-Out by Architecture
The California Consumer Privacy Act (CCPA), as amended by the California Privacy Rights Act (CPRA), grants consumers the right to opt out of the sale of their personal information (Cal. Civ. Code § 1798.120). In the BioNFT model, data is never sold by GenoBank.io because GenoBank.io never possesses or controls the data. When a researcher accesses patient data via x402 BioRouter, the payment flows directly from the researcher's wallet to the patient's wallet — GenoBank.io is not a party to the data transaction. The patient is not "opting out" of data sales; they are the seller, setting their own price via BioPIL terms.
CPRA's category of "Sensitive Personal Information" — which explicitly includes genetic data (Cal. Civ. Code § 1798.140(ae)(4)) — receives the highest protection. In the BioNFT model, sensitive genetic data never resides on any company's infrastructure in identified form. The only entity with access to identified genetic data is the patient themselves, via their BioWallet.
7.1.4 BioNFTs and GINA: Discrimination Protection Through Opacity
The Genetic Information Nondiscrimination Act of 2008 (GINA) prohibits health insurers (Title I) and employers (Title II) from using genetic information in coverage, underwriting, or employment decisions. However, GINA has a fundamental weakness: it assumes that genetic data flows through institutional channels where it could be accessed by insurers or employers.
BioNFTs add a complementary protection layer: because the patient's genomic data is stored in a BioNFT-gated vault accessible only via the patient's wallet key, it is technically impossible for an insurer or employer to access the data without the patient's explicit authorization (in the form of a BioPIL license grant). The Bloom filter mechanism enables variant-level queries without exposing the underlying data, so a patient can participate in research studies that query for specific variants (e.g., "do you carry BRCA1 c.5266dupC?") without revealing their complete genetic profile to anyone.
This creates discrimination protection by design: even if GINA's legal protections were weakened or repealed, the cryptographic access control of the BioNFT vault would still prevent unauthorized access to genetic information.
7.1.5 BioNFTs and the 21st Century Cures Act: Patient Access Fulfilled
The 21st Century Cures Act (2016) and the ONC Patient Access Final Rule (CMS-9115-F) mandate that patients have electronic access to their health information "without special effort." The FHIR R4 Patient Access API implemented by Epic MyChart is the technical fulfillment of this mandate. GenoClaw's Epic FHIR integration takes this mandate to its logical conclusion: the patient not only accesses their data — they import it into a sovereign vault where they can analyze it with AI, annotate it with genomic databases, and license it for research.
The BioNFT serves as the patient's receipt of data sovereignty: it proves that the patient exercised their Cures Act right to access their data, imported it into their own infrastructure, and now holds it as a cryptographically verified asset. This is the first system that treats the Cures Act not as a regulatory checkbox (providing an API endpoint) but as a genuine transfer of data ownership from institution to patient.
7.2 Limitations and Current Constraints
The following limitations are acknowledged:
- GPU Instance Availability: Clara Parabricks processing requires an active NVIDIA A100 instance on Google Compute Engine. The current architecture provisions the instance on demand and deprovisions it after job completion, which introduces a 3–5 minute cold-start latency for the first job after inactivity. Warm instance pools are a planned improvement.
- FHIR Provider Coverage: SMART on FHIR Patient Access API support is mandatory for US EHR vendors under the ONC Final Rule, but implementation quality varies significantly. Non-Epic providers may return incomplete resource sets. The current production pipeline is validated for Epic MyChart at UCSF Health.
- LLM Context Window: With 2.87 million variant records and 1,463 clinical records, the full patient knowledge graph cannot fit within any current LLM context window. The bio-orchestrator selects relevant subsets using vector similarity search and structured query pre-filtering before constructing the LLM prompt.
- Bloom Filter False Positive Rate: The production Bloom filter configuration (m = 2^26 bits, k = 7 hash functions) achieves a false positive rate of approximately 0.0008 for a typical variant set of 50,000 common variants. This rate is acceptable for cohort discovery but requires confirmation with raw data access before clinical action.
- x402 BioRouter Liquidity: The agent-to-agent commerce network requires sufficient agent density to function as a marketplace. The network becomes more valuable as more patients deploy GenoClaw instances. Early-stage incentives for patient onboarding are under development.
7.3 Comparison with Competing Approaches
| Feature | GenoClaw | Federated Learning | Traditional EHR AI | Personal Health Record Apps |
|---|---|---|---|---|
| Patient data ownership | Cryptographic (BioNFT) | Company (gradient server) | Company / Health System | Nominal (ToS-dependent) |
| Data quality | Complete, authentic | Degraded (gradient approx.) | Complete (inaccessible) | Variable |
| Patient attribution | On-chain, permanent | None | None | None |
| Revenue sharing | Automatic micropayment | None | None | None |
| Consent revocation | Cryptographic (token burn) | Impossible after incorporation | Legal request only | Account deletion |
| GDPR Art. 17 compliance | Architectural | Structural conflict | Policy + legal | Policy |
| Bankruptcy protection | Patient holds keys | None | None | None |
| AI analysis capability | Full (9 skills) | Aggregate only | Platform-dependent | Basic |
7.4 Future Work
Several extensions to GenoClaw are in active development or planned for the near term:
- Trio Analysis and Family Inheritance: BioPIL #9 (Family Inheritance) enables inherited genomic data rights. A parent who mints a BioNFT for a newborn's genomic data can specify that the BioNFT transfers to the child at age of majority. Clara Parabricks DeepVariant trio mode is available for joint parent-offspring variant calling.
- ICoNS (International Classification of Neonatal Sequencing) Integration: GenoBank.io is building a newborn genomic screening pipeline aligned with the ICoNS gene list framework, incorporating BeginNGS and GUARDIAN study protocols for population-scale neonatal sequencing.[22,23]
-
AlphaGenome Regulatory Variant Interpretation: The
alphagenome-interpretskill is being integrated with Google DeepMind's AlphaGenome model for sequence-based prediction of regulatory variant effects — particularly splicing and transcription factor binding site disruption. - PACS / Medical Imaging Integration: GenoClaw's consent and access-control architecture is being extended to DICOM imaging data via a PACS (Picture Archiving and Communication System) integration layer, enabling patients to own and share their radiology and pathology imaging alongside genomic and clinical data.
- Biodata Dividend Calculation: Shapley value computation for attributing model accuracy improvements to individual patient data contributions is an active research area. A production-ready Biodata Dividend calculator is planned for release in Q3 2026.
8. Conclusion
GenoClaw demonstrates that patient ownership of AI health agents is not a theoretical aspiration but a deployable, production-validated architecture. The system processes whole-genome sequencing data in under two minutes, integrates over 1,400 structured clinical records from live EHR systems, annotates variants against twelve oncology knowledge bases, and exposes nine bioinformatics skills through a natural language interface — all while maintaining cryptographic patient data sovereignty, HIPAA-compliant de-identification, and on-chain consent audit trails.
The 23andMe bankruptcy was not an anomaly. It was the natural endpoint of a custodial data model that treats patient genomic information as a corporate asset. Any company that holds patient genomic data as its primary asset is a company whose interests are structurally misaligned with those of its patients, regardless of its privacy policy, its consent forms, or its regulatory compliance posture.
The Decentralized Virtual Bioinformatic Machine architecture eliminates the structural misalignment by ensuring that patient data never becomes a company asset in the first place. The patient's BioWallet is not a privacy policy — it is a cryptographic key. The patient's BioPIL license is not a consent form — it is a smart contract. The patient's Biodata Dividend is not a loyalty program — it is an automated payment from blockchain settlement.
GenoBank.io's vision is a galaxy of patient-owned AI agents, each a complete Decentralized Virtual Bioinformatic Machine, interacting with researchers, clinicians, and AI systems through transparent, auditable, and economically fair protocols. GenoClaw is the first instance of this galaxy. The x402 BioRouter is the infrastructure that will connect them.
Privacy is not about hiding data or making it fuzzy. Privacy is about giving patients complete control over their authentic, high-quality data, with full transparency about its use and fair compensation for its value.
References
seccomp(2) manual page.