AFWERX Lineage Pipeline Readiness Brief

Executive Summary

The Bayes-Council chartered a concentrated upgrade sprint on the AFWERX “actual” lineage pipelines. The goal: ingest and audit 20,058 real awards with zero simulated-data dependencies while surfacing every SR3 cue, GAO data-quality warning, and policy artifact analysts need. We now have mirrored Python and R pipelines that emit identical schemas (including raw descriptions), verified edge-case coverage, and fresh parquet/CSV outputs for the entire award set.

Key Wins

Focus	Outcome	Source of Truth
Metadata SR3 ingestion	`metadata_sr_code`, `metadata_phase_level`, and `lineage_flag` persist across Python/R plus review CSVs	`pipeline/actual/python/fpds_lineage_pipeline.py`, `pipeline/actual/R/fpds_lineage_pipeline.R`
GAO data-quality tagging	`source_data_quality` derives from `pipeline/actual/data/gao_data_quality.csv` and threads into all dashboards	Same as above + config wiring
Prior SBIR references	`prior_sbir_reference` + `reference_source` columns document declared vs. inferred PIIDs	Predictions parquet
Description fidelity	`description_raw` now bundled into lineage predictions for both languages	Latest full-data runs
Edge-case regex tests	`pipeline/actual/R/regex_edge_case_tests.R` fabricates AFWERX-shaped parquet, exercises actual pipeline, and asserts every cue family + metadata field	Regression suite
Full-data refresh	Both pipelines re-generated outputs on 20,058 awards (Python anomalies: 19,961, R anomalies: 20,044)	`pipeline/actual/output/` + logs

Bayes-Council Guidance

Jan Goyvaerts (Regex Doctrine): Demanded deterministic cue coverage. Outcome: expanded regex packs (phase, topic, PIID, policy, platform, collocation, phase history, contract refs) plus the enhanced edge-case test harness to prevent regression drift.
Prof. Claire Cardie (IE/NLP): Directed us to preserve raw descriptions and semantic hooks. We now carry description_raw, textual_reference, and semantic match slots through both outputs for downstream explainability.
Dr. Colin Raffel (LLM Systems): Advocated for SR3 lineage flags + metadata conflicts so model-driven reviewers can compare deterministic metadata vs inferred cues. The pipelines now emit lineage_flag states (metadata_phase3, inferred_phase3_high, etc.) and log any SR3/policy disagreements.
John Williams (Policy): Required GAO verification tags and prior SBIR references for compliance readiness. Implemented via the GAO CSV lookup and prior_sbir_reference column, ensuring each award exposes its agency’s FY23 data-quality status.
Bryant Helms (Transition Ops): Pushed for review-surface readiness. Outputs now include transition summaries, vendor anomalies, and review summaries aligned to the metadata fields analysts will inspect.
Kevin Brancato (Data Quality): Enforced real-data smoke tests (10k, 15k, full) plus regression thresholds (precision/recall ≥0.6) so we can cite evidence before stakeholder pilots.

Implementation Highlights

SR3 / Phase Lineage: Normalized code_research into SR/ST codes, mapped to phase labels, and derived metadata_phase3_flag. Each record gets a lineage_flag plus conflict diagnostics in policy_json.
GAO Quality Map: pipeline/actual/data/gao_data_quality.csv feeds a lookup that tags each award’s source_data_quality (“verified_fy23 – Submitted SBIR Phase III compliance package FY23”, etc.). The tags show up in predictions and transition summaries.
Policy & Compliance Payloads: policy_json now includes SR codes, competition flags, and metadata conflicts so analysts know why a record escalated.
Description Persistence: Added description_raw to both predictions parquet files for transparency and to support future semantic matching.
Config-driven Edge Harness: pipeline/actual/R/regex_edge_case_tests.R now sources the actual pipeline, fabricates a mini-parquet of legacy edge cases, runs the config-aware pipeline, and asserts: cue coverage, metadata columns, GAO tags, and metrics shape.

Linkage Methods (Visual Map)

Method	Signal Type	Source Fields	Usage
SR3 Metadata Flagging	Deterministic	`code_research` → `metadata_sr_code`	Sets `lineage_flag="metadata_phase3"`, drives policy audits
Textual Regex Cues	Deterministic	`description_raw`, structured pairs	Phase/Topic/PIID/Policy/Platform/Collocation/Phase-history/Contract references
Semantic Topic Matcher	Probabilistic assist	`analysis_text`	Adds fuzzy topic hits when textual tokens are absent
Fuzzy PIID Reference	Probabilistic assist	Regex PIIDs + `fuzzy_reference()`	Links orphan awards to prior PIIDs for `prior_sbir_reference`
Vendor History & Baselines	Contextual	Aggregated award/vendor stats	Feeds review escalation (“SBIR vendor silent”) + anomaly logs
GAO Data-Quality Overlay	Risk tagging	`source_data_quality` lookup	Colors transition summaries & analyst dashboards
Policy Payload	Compliance	`policy_json` fields	Captures SR3/competition conflicts, required artifacts, escalation hints

METADATA (SR3/ST codes) ─┐
                         ├─► LINEAGE FLAG (metadata vs inferred)
TEXT REGEX (phase/topic) ─┘
        │
        ▼
DETERMINISTIC CUES ──► EVIDENCE JSON ──► REVIEW STATUS
        │                            \
        │                             └► POLICY PAYLOAD (conflicts, checklists)
        ▼
SEMANTIC + FUZZY MATCHERS ──► PRIOR SBIR REFERENCE
        │
        ▼
GAO TAGS + TRANSITION FEATURES ──► OUTREACH PRIORITIZATION

How it fits together

Metadata first, text second: SR3/ST codes immediately yield deterministic lineage when present; regex cues and semantic assists fill the gap when metadata is silent or inconsistent.
Evidence serialization: Every cue (phase/topic/PIID/policy/platform/collocation/phase-history/contract) is logged into evidence_json, so analysts can inspect why a record escalated.
Conflict logging: policy_json pairs metadata with textual evidence to highlight SR3 vs. competition-field mismatches before FPDS submission or report-out.
Risk overlays: source_data_quality, transition_priority, and vendor anomalies give outreach teams a high/medium/monitor triage path without re-running analytics.

Tagging Methodology (Python + R)

FPDS Research Codes → SR/ST Map → Phase Level
        │                │             │
        ▼                ▼             ▼
   metadata_sr_code ─► metadata_phase_level ─► metadata_phase3_flag
        │                                     │
        ├─ SR3/ST3 present → lineage_flag = metadata_phase3
        └─ SR/ST missing  → textual + semantic cues set inferred flags

TEXT + SEMANTICS
- Regex families: phase, topic, PIID, policy, platform, collocation,
                  phase history, contract refs
- Semantic topic matcher fills gaps
- Fuzzy PIID linking builds `prior_sbir_reference`

        │
        ▼
EVIDENCE JSON  ──►  REVIEW STATUS / LINEAGE CLASS
POLICY JSON    ──►  Escalation hints + compliance checklists

RISK TAGS
- GAO lookup → `source_data_quality`
- Transition features → `days_since_last_phase`, `transition_priority`
- Vendor baselines → anomaly flags

Cue families (what they detect)

Cue	Sample Logic	Purpose
Phase	Regexes for `PHASE\s+(III\|II\|I\|3\|2\|1)` plus variants (`PH-III`, `PHASE 2.5`)	Deterministic lineage when text explicitly cites SBIR phase history
Topic	Patterns like `(AF\|N\|A\|SOCOM)\d{2,3}[A-Z]-\d{3}`	Captures SBIR/STTR topic numbers when agency metadata is missing
PIID	Normalizes free-form PIIDs (`FA8650-23-C-5012`, etc.)	Links awards back to prior phases; feeds `prior_sbir_reference`
Policy	Keywords (`SBIR DATA RIGHTS`, `DD2579`, `PHASE III SOLE SOURCE`, etc.)	Signals contractual/legal hooks that imply lineage even without explicit phase text
Platform	Codes such as `F-35`, `C2ISR`, `SATCOM`, `CYBER`	Adds mission context; helps reviewers tie lineage to specific programs
Collocation	Phrases like `PREVIOUS SBIR`, `FOLLOW-ON SBIR`, `TRANSITION`	Broad net that flags narrative references to earlier phases
Phase history	Regex for `PHASE II` within proximity to `SBIR/award/contract`	Confirms continuation when descriptions mention prior phase work
Contract refs	Patterns like `CONTRACT NO. FA8650-20-C-1234`, `MOD P0001`	Documents explicit contractual lineage and helps infer references when SR3 is absent

In both Python and R, these cues are scored deterministically; their presence pushes lineage_score above threshold when metadata is silent, and every match is serialized into evidence_json for auditability.

Regex Cue Performance (Visuals)

What the visuals show

Deterministic phase cues dominate. More than 3,200 phase hits (spanning 2,866 awards) prove most textual lineage still comes from explicit “Phase II/III” statements.
Policy and platform cues add depth. ≈2,000 policy mentions and ≈1,000 platform callouts strengthen compliance payloads and mission tagging when SR3 metadata is absent.
PIID references are rare but high value. Only 98 awards surfaced textual PIIDs; the fuzzy PIID linker plus prior_sbir_reference capture those chains for future audits.
Collocation + phase-history nets catch the long tail. 1,900+ collocation hits and ~1,000 phase-history hits ensure “previous SBIR” phrases or legacy phase mentions aren’t dropped even in messy descriptions.

Together these cues explain why deterministic precision stays high: metadata flags catch most SR3 cases, and the regex matrix backs them up when text is the only evidence.

Actual Run Results (20,058 AFWERX awards)

Metric	Value
Total awards processed	20,058
Metadata SR3 present	18,193 (90.7%)
Lineage classes	No Lineage: 17,059 · Probable: 1,993 · Weak: 985 · Phase III High Confidence: 21
Review status distribution	Escalate – SBIR vendor silent: 12,191 · Analyst Review: 2,978 · No Action: 4,868 · Auto-accept: 21
Transition priority	Monitor: 18,044 · High: 1,993 · Immediate: 21
GAO data-quality tags	Verified FY23 (compliance package): 9,908 · Unknown: 4,353 · Verified FY23 (cycle completed): 2,965 · Needs follow-up: 2,644 · Partial verification: 188
Anomalies recorded	Python: 19,961 rows · R: 20,044 rows (primarily missing descriptions or orphaned phase indicators)

Interpretation

Lineage lift: 21 records land in fully deterministic “Phase III (High Confidence)” status; another 1,993 are “Probable” with deterministic cues. Weak signals (985) are retained for analyst exploration.
Review workload: 12k+ awards require escalation because vendors are known SBIR performers but textual cues weren’t deterministic. 2,978 are queued for analyst review due to probabilistic cues; only 21 can be auto-accepted.
Metadata fidelity: 90%+ of awards preserved the SR3 flag directly from FPDS metadata; the remainder rely on inferred lineage plus the new conflict logging to surface discrepancies.
GAO risk overlay: Roughly half of the data (9,908 + 2,965) comes from agencies with verified FY23 data-quality work. 2,644 awards are tied to agencies flagged for incomplete corrective actions, so the dashboards highlight these for manual confirmation before outreach.
Transition posture: 1,993 awards fall into “High” transition priority (recent Phase II/III lineage plus on-time metadata). Only 21 surface as “Immediate,” reflecting the scarcity of fully documented Phase III awards in the raw FPDS feed.
Diagnostics: The anomaly CSVs document missing descriptions and phase references so analysts know where to backfill narratives; vendor anomaly logs capture obligation spikes (≥50% over vendor baseline) for follow-up.

Testing & Evidence

Test	Scope	Result
`python pipeline/actual/python/fpds_lineage_pipeline.py --limit 10000`	Real awards sample	Pass (Arrow warnings only)
`Rscript pipeline/actual/R/fpds_lineage_pipeline.R --limit 10000/15000`	Real awards sample	Pass (reticulate warnings noted)
Full-data runs (Python + R)	20,058 awards	Pass – outputs refreshed 2025‑11‑11
`Rscript pipeline/actual/R/regex_edge_case_tests.R`	Edge parquet via actual pipeline	Pass – asserts all regex families & metadata columns
Metrics regression	Both pipelines	Precision/recall logged at ≥0.8/1.0 on the edge harness, ≥0.6 on real-data baselines

Noise to ignore: Arrow sysctlbyname warnings and reticulate’s attempt to probe /opt/miniconda3 inside the sandbox. They do not affect outputs but are documented for situational awareness.

Outputs Delivered (2025‑11‑11)

pipeline/actual/output/lineage_predictions.parquet (Python + R variants) — includes metadata, GAO tags, description text, evidence JSON.
pipeline/actual/output/lineage_review_summary.csv
pipeline/actual/output/regex_hits_summary.csv
pipeline/actual/output/transition_summary.csv
pipeline/actual/output/lineage_metrics.csv & pipeline/actual/output/lineage_confusion.csv
pipeline/actual/output/lineage_anomalies.csv
pipeline/actual/logs/vendor_anomalies.csv