# Data Conventions

This document specifies how data is structured, named, and reported across the OSF project *Pharmacokinetics and Neovaginal Microbiome Dynamics of a 28-Day Cyclical Hormone Protocol in a Peritoneal Flap Vaginoplasty: An N=1 Case Study*.

## Scope

This is a self-funded, single-subject (N=1) case study. Data is collected by the investigator using consumer-grade home monitoring tools (Inito fertility monitor, Apple Watch, Clue app, Evvy microbiome kit) supplemented by clinical labs and the investigator's own protocol logging.

The conventions below support a methodologically defensible proof-of-concept dataset. They are not intended to match the rigor of a multi-site clinical trial. The intent is to provide enough structure and documentation that a clinician or researcher could use this dataset to bootstrap a properly designed study, or to evaluate the protocol's effects on this specific subject.

## Purpose

The conventions exist to keep the dataset internally consistent across cycles and across data streams, to make the data legible to outside reviewers and potential collaborators, and to make future cycle uploads a routine procedural task rather than a series of structural decisions.

This document applies to all components and supersedes any prior unstated convention.

## Component Structure

The project is organized into components, each containing a defined slice of the longitudinal record:

- **Baseline**: pre-intervention and pre-first-cyclic-injection data. Spans the washout period and the early days of cycle 1 prior to the first cyclic EV injection. Includes the November 2025 pre-washout steady-state lab draw.
- **Cycle N** (N = 1, 2, 3...): all data collected during the 28 days of cycle N that does not belong in the baseline component (see *Cross-Component Placement*).

Each component contains its own README, data files, and data dictionary additions. Components are not edited retroactively except for corrections (see *Corrections and Versioning*).

## File Naming

All cycle component files are prefixed with the cycle number:

| Pattern | Example |
|---------|---------|
| `cycle{N}_protocol_log.csv` | `cycle1_protocol_log.csv` |
| `cycle{N}_{stream}.csv` | `cycle1_inito.csv`, `cycle2_clue_daily.csv` |
| `cycle{N}_data_dictionary_additions.csv` | `cycle1_data_dictionary_additions.csv` |
| `README.md` | one per component, no prefix |

Baseline files use the `baseline_` prefix instead of `cycle{N}_`.

## Schema Continuity

Every stream file (Inito, Clue daily, wearables, serum labs, Evvy microbiome, Evvy summary) in a cycle component uses the **exact** column set, column order, and column naming as the corresponding baseline file. No new columns, no renamed columns, no reordered columns.

If a measurement type appears in cycle N that was not present in baseline, the new variable is documented in the cycle's data dictionary additions but is not added as a column to a stream file. If structural changes are needed, they go through a deliberate revision process and apply to all components going forward.

The protocol log file is the exception. It is unique to cycle components and has no baseline equivalent. Its schema is documented in `cycle1_data_dictionary_additions.csv` and is preserved across cycles.

## Sources of Truth

When the same value is recorded in more than one source, the following hierarchy applies:

| Data type | Authoritative source | Notes |
|-----------|----------------------|-------|
| Inito readings (E3G, LH, PdG, FSH, BBT) | Inito API, synced into the Cycle Planner MCP (cycle tracker database) | Values originate from the Inito device and are pulled directly from Inito's API. The cycle tracker upserts these values, overwriting any temporary manual entries. See *Inito Data Source*. |
| Apple Watch biometrics (HRV, RHR, skin temp, sleep) | Clue measurements JSON, filtered to `source = "wearable_apple"` | |
| Self-reported symptoms, tags, period, spotting | Clue measurements JSON, filtered to `source != "wearable_apple"` | |
| Serum lab values | Original lab report PDF | |
| Microbiome species, summary scores, STI panel, AMR | Evvy provider report | |
| Protocol administration (planned and actual) | Cycle Planner MCP (cycle tracker database) | Reconciled against memory and supporting evidence |

The cycle tracker is the operational logging tool. Inito hormone values and protocol administration are entered by the investigator at the time of testing or dosing. The cycle tracker exposes its data via an MCP server, which is queried programmatically when building data files.

## Inito Data Source

Inito readings are pulled directly from Inito's API and synced into the cycle tracker, which is queried via MCP when building data files. The API returns the same values the Inito device produces, at native source precision. There is no transcription step.

The cycle tracker may temporarily contain values manually entered by the investigator before a sync has been run (for example, to avoid logging into the Inito app on a test morning). These manual entries are transient placeholders; running the API sync upserts the authoritative values from Inito and overwrites any manual entry. Data files for publication are always built after sync, so all published Inito values are API-sourced.

### API field mappings

The Inito API returns several fields per test. The dataset captures a subset relevant to the strip type used in this study:

| API field | Dataset column | Notes |
|-----------|----------------|-------|
| `e3g_value` | `e3g_ng_ml` | Direct mapping (estrone-3-glucuronide) |
| `beta_lh_value` | `lh_miu_ml` | Subject uses Beta LH-only Inito strips. `alpha_lh_value` is always null for this strip type. The general-purpose `lh_value` field mirrors `beta_lh_value` for Beta LH-only strips, but `beta_lh_value` is mapped explicitly for unambiguity. |
| `pdg_value` | `pdg_ug_ml` | Direct mapping (pregnanediol glucuronide) |
| `ifsh_value` | `fsh_miu_ml` | The Inito app displays `ifsh_value` (intact FSH) as "FSH". The general-purpose `fsh_value` field is null for this strip type. |

Fields not captured: `hcg_value` (no pregnancy testing in this protocol), `alpha_lh_value` (not measured by Beta LH-only strips).

### Precision

The API returns values at native device precision. Precision varies by reading (some 1 decimal, some 2). The dataset preserves whatever precision the API returns. Trailing zeros are not added.

No LOD floor is applied. Inito reports values below 0.1 (e.g., LH 0.05, FSH 0.07) and these are preserved as recorded.

## Formatting

| Element | Convention |
|---------|------------|
| Encoding | UTF-8 |
| Dates | `YYYY-MM-DD` |
| Times | `HH:MM` (24-hour, local time) |
| Phase names | `snake_case`: `early_follicular`, `mid_follicular`, `ovulatory`, `early_luteal`, `mid_luteal`, `late_luteal`, `withdrawal` |
| Multi-value cells | semicolon-separated, no spaces (e.g. `tampon;pad`) |
| Empty cells | blank; no `NA`, `null`, `none`, or placeholder values |
| Numeric precision | Hormone values preserved at the precision returned by the Inito API (mirrors the Inito app display). Precision varies by reading (some 1 decimal, some 2). Trailing zeros are not added. No LOD floor. See *Inito Data Source*. |
| Line endings | match the baseline file for the corresponding stream |
| Em dashes | not used anywhere in the dataset, including notes and README text |

## Notes Columns

Notes columns in data files contain factual events only. Specifically:

- Day-count chronology relative to injections (e.g., "Day 2 post second injection")
- Same-day administration timing (e.g., "Vaginal estradiol cream applied evening post-FMU")
- Cross-references to data in other files (e.g., "See `baseline_serum_labs.csv`")

Notes columns do not contain:

- Mechanism claims, hypotheses, or interpretive framing
- Phrases like "may include", "consistent with", "near peak", or similar inferences
- Comparisons to other readings or population reference ranges

Interpretive content belongs in the component README's narrative sections or in separate analysis documents, not in the data files. Baseline files predate this convention and contain some interpretive notes. Cycle 1 onward conforms.

## Row Inclusion Rules

Different streams use different row inclusion policies based on what makes the resulting file most useful:

| File | Policy |
|------|--------|
| `*_protocol_log.csv` | Every CD1-CD28 day. Actuals blank where not logged. Do not infer actuals from intent. |
| `*_inito.csv` | Only days with readings. Skipped days are not represented as blank rows. |
| `*_clue_daily.csv` | Only dates with at least one self-reported entry. All-empty dates are dropped. |
| `*_wearables.csv` | Every date in the cycle range. Cells blank where Apple Watch did not record. |
| `*_serum_labs.csv` | Only days with draws collected. |
| `*_evvy_*.csv` | Only collection dates. |

## Cross-Component Placement

Most data collected during cycle N belongs in the cycle N component. Two exceptions:

1. **Pre-first-cyclic-injection data**: any cycle 1 data point collected before the first cyclic EV injection (April 5, 2026) belongs in the baseline component. This includes CD1-CD5 Inito readings, the CD2 trough serum draw, and the CD2 Evvy swab.
2. **Washout-endpoint or baseline-reference measurements**: a measurement collected during cycle N but functioning to characterize a pre-cyclic state (e.g., a trough draw before a cyclic injection has been administered) belongs in the baseline component.

When data is placed in baseline that a reader might expect to find in the cycle component, the cycle README's *Files Not Included* section names the data, points to the baseline file, and gives the rationale.

## Data Dictionary

Each cycle component contributes a `cycle{N}_data_dictionary_additions.csv` file. This file contains:

- Full entries for any new variables introduced in this cycle (e.g., protocol log columns in cycle 1)
- A single `_schema_note` entry per stream file whose schema matches baseline, pointing readers to baseline for column definitions

Entries are not duplicated across cycles. If a variable was documented in cycle 1, cycle 2 does not redocument it.

The schema for the data dictionary itself is `file, variable, description, units, source, collection_method, notes`.

## README Structure

Each component README follows the same outline:

- Overview (scope, date range, what is covered)
- Subject Characteristics (baseline component only) or Protocol Summary (cycle components)
- Files (table of included files with descriptions and sources)
- Files Not Included (rationale for any expected file that is absent)
- Data Quality Notes
  - Sources of truth
  - Logging gaps
  - Protocol deviations (cycle components)
  - Other notes (artifacts, cross-references, edge cases)
- File Conventions (date format, encoding, etc.)

The README is allowed narrative latitude that the data files are not. Interpretation, hypotheses, and contextual framing belong here, not in notes columns.

## Discrepancy Reconciliation

When values disagree between sources, the procedure is:

1. Apply the *Sources of Truth* hierarchy. The upstream source wins.
2. Correct the downstream source (typically the cycle tracker) to match.
3. Document the discrepancy in the relevant README's data quality notes only if the resolution affects published values or if the pattern is methodologically informative.

The dataset records reconciled values. It does not preserve a record of every transcription error caught in pre-publication review.

## Corrections and Versioning

Once a component is uploaded to OSF, corrections follow this procedure:

1. **Substantive corrections** (a value is wrong, a row is missing, a deviation was misclassified): edit the file, note the correction in the component README under a *Corrections* subsection with the date and a brief description, and re-upload.
2. **Cosmetic corrections** (typos in narrative text, formatting fixes): edit and re-upload without README annotation.
3. **Schema changes** (column added, renamed, or restructured): apply to all components going forward and document in this conventions file. Do not retroactively reshape historical components without explicit notation.

## Document Maintenance

This conventions document is updated when a new convention is needed or an existing one is revised. Changes are reflected in all subsequent component builds. When this document is updated, the change is described briefly in a *Changelog* section at the bottom of the file.
