Korean Accounting Enforcement Dataset
240 Korean accounting violations coded across FSS and SFC enforcement records, DART-linked and validated through a five-phase bias audit — the first open, reproducible Korean enforcement dataset with machine-readable violation taxonomy and Beneish ratio coverage for named companies.
Overview
A structured dataset of Korean accounting fraud enforcement decisions, extracted from Financial Supervisory Service (FSS) and Securities & Futures Commission (SFC) publications. The dataset covers 240 violations across three source document types, with a bias-validated six-type taxonomy applied consistently across both regulators. 86 named companies are matched to DART corp_codes, enabling cross-reference to financial statement data. 60 company-years have Beneish ratios computed from DART filings. The full enrichment pipeline is reproducible — LLM-enriched from PDF/HWP source documents, validated through cohort splitting, blind prompt stripping, and cross-model replication.
Problem
Korean accounting fraud enforcement is unusually well-documented and unusually hard to use. The FSS publishes quarterly enforcement reports as PDFs in Korean. The SFC publishes its quarterly decisions in ZIP files containing PDFs. Neither is structured. Anyone wanting to do empirical research on Korean accounting fraud has to extract the violations themselves — once for the PDF parsing, once for Korean text normalization, once for the violation taxonomy, and once again for the cross-reference to DART if Beneish ratios are needed. The aggregate cost across research groups is enormous. The marginal benefit of doing it well once and publishing it is high. A secondary problem: LLM enrichment of structured classification tasks is unreliable without validation. The natural failure mode is a prompt that scaffolds the answer — the model applies a label because the description matches the task surface, not because the source text provides specific evidence. Detecting this requires running the same cases with different prompts and different models, not just checking inter-model agreement on a shared prompt.
Constraints
- FSS Source 3 (229 cases): companies anonymized as 사A, 사B — no DART link possible; taxonomy work is limited to anonymized cohort
- 50 pre-2022 FSS PDFs are binary-corrupted at the source level — pdfplumber and pypdfium2 both fail; the 65-ok / 50-failed split is a permanent ceiling
- DART match ceiling ~90%: some named companies have restructured, delisted, or use name variants that cannot be resolved to active corp_codes
- Beneish coverage limited to 60 company-years: computation requires two consecutive years of comparable DART financials; many companies have gaps or use nature-of-expense reporting
- Dataset covers three of eight identified enforcement sources — data.go.kr structured CSV, CaseNote, auditor-side findings, FSC press releases, and three others are documented as v2.0 candidates
Approach
Three CSV outputs, each with a distinct analytical role. violations.csv (240 rows) captures the violation type from a closed-set taxonomy, the scheme type, forensic signals linked to the jfia-forensic detectlet vocabulary, and the source document. beneish_ratios.csv (60 rows) joins named-company violations to DART financial statements for the named cohort. dart_matches.csv (86 rows) maps named companies to DART corp_codes at roughly 90% match rate. The taxonomy went through five bias validation phases before the final production run. The enrichment uses a tiered model delegation: Haiku for high-volume structured classification with tool-use enum constraints, Sonnet for nuanced review and comparison, Opus reserved exclusively for the SFC final synthesis — one call interpreting ambiguous empirical findings against expected taxonomy patterns. Total cost through prompt repair: under $2.
Key Decisions
Six closed violation types rather than open tagging
A closed taxonomy enables quantitative analysis across the dataset — violation type frequency, Beneish component separability by type, precision computation. Open tagging produces richer per-case descriptions but cannot be aggregated. The six types (asset_inflation, revenue_fabrication, disclosure_fraud, liability_suppression, related_party, cost_distortion) were derived from the Korean regulatory enforcement literature and cover the material categories present in FSS and SFC decisions. Cases outside the six types are preserved with violation_type=null and full extracted text rather than forced into a category.
- Open tagging with post-hoc clustering — more flexible but prevents the precision computation that validates the taxonomy against Beneish components
- Hierarchical taxonomy (primary + secondary types) — better granularity but multiplies the inter-annotator agreement problem
Five-phase bias validation protocol
Standard inter-model agreement testing (do Haiku and Sonnet agree?) cannot detect a universal prompt scaffold — if both models read the same generic description and apply the same label, agreement is high but the label may not reflect the case text. Phase A1 (cohort split: full-text vs metadata-only enrichment) detected differential scaffolding but missed universal artifacts. Phase A2 (blind prompt stripping: same cases, descriptions removed) was the critical design: TATA dropped from 100% to 20% when the description 'large unexplained total accruals' was removed. This revealed that the model was applying TATA to every fraud case because the description matched every fraud case — not because the text evidenced unexplained accruals. Phase A3 (cross-model with full prompt) confirmed TATA at 95% for Sonnet — ruling out a Haiku-specific artifact. Phase A4 repaired the three problematic descriptions (TATA, LVGI, GMI). Phase A5 re-ran production enrichment with the repaired prompt, recovering LVGI (89% precision post-repair) and reducing TATA to 34% concentrated in plausible violation types.
- Single inter-model agreement check (standard approach) — misses universal scaffolds that fire equally on both models; A3 would have passed TATA at 95% without A2's blind test revealing the artifact
- Human spot-check of 20 cases — slower, limited to expert availability, and still subject to the same prompt-reading bias the model has
Haiku for bulk classification, Sonnet for review, Opus only for final analytical synthesis
Haiku handles the high-volume structured classification (metadata-only enrichment, blind test, A2 replication) efficiently at $0.03–0.10 for hundreds of cases with tool-use enum constraints. Sonnet handles review and nuanced comparison where calibration matters — A3 showed Sonnet produces 0 out-of-vocabulary signals when descriptions are removed (vs Haiku's 9), making it the correct choice for modified-prompt runs. Opus is reserved exclusively for the SFC final synthesis: interpreting ambiguous empirical patterns against expected taxonomy with noisy labeled data, where getting the interpretation wrong has downstream consequences. The prompt repair phase confirmed this routing was correct: Sonnet batch took 1.5 hours for 20 cases; sequential Sonnet took 3 minutes. Always use sequential for review-grade inference; batch for bulk classification.
- Sonnet throughout — 5–10× more expensive for bulk classification with no quality benefit on well-defined tool-use tasks
- Opus for all enrichment — prohibitively expensive at scale; Opus retained a clear edge only on ambiguous multi-step analytical judgment
Tech Stack
- Python ≥3.11, uv
- pdfplumber, pypdfium2 (PDF extraction)
- Anthropic SDK (claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-6)
- DART OpenAPI (financial statements for Beneish computation)
- kr-beneish (Beneish ratio computation for matched companies)
- pandas, pathlib
Result & Impact
- 240Total violations coded
- 86Named companies DART-matched
- 60Company-years with Beneish ratios
- 3 (FSS named, FSS anonymized, SFC)Source regulators
- 5Bias validation phases
- 95%SGI → revenue_fabrication precision
- 90%AQI → asset_inflation precision
The first open, reproducible Korean accounting enforcement dataset with a bias-validated taxonomy. The five-phase validation protocol demonstrates that LLM-enriched classification datasets can be made defensible — the TATA finding (100% assignment collapsing to 20% under blind stripping) is a concrete, falsifiable result that shows how generic prompt descriptions masquerade as empirical signal. The dataset's value is as a living artifact: when FSS publishes its next quarterly report, re-running the pipeline updates the dataset with one command.
Learnings
- LLM prompt scaffolding is undetectable through inter-model agreement alone. TATA fired at 95% for both Haiku and Sonnet with the full prompt — high agreement, but both models were reading the description, not the case text. The blind test (same cases, stripped descriptions) is the only design that detects a universal scaffold. Standard inter-annotator agreement checks catch differential biases; they miss shared ones.
- Prompt specificity is the primary quality lever for structured extraction. Rewriting TATA's description from 'large unexplained total accruals' to 'total accruals are materially large relative to assets with no clear business explanation; assign only when the case text specifically references accrual magnitude or reversal patterns' reduced the assignment rate from 100% to 34% concentrated in plausible violation types. The fix was one sentence. The validation to find it required five phases.
- DART match rate has a hard ceiling determined by company lifecycle, not data quality. Restructured, merged, or delisted companies produce name variants that cannot be resolved to active corp_codes. For Korean enforcement research, the effective labeled sample size for ML work is the 60 Beneish rows, not the 240 violations — the constraint is financial history availability, not extraction quality.
- Model delegation routing matters operationally. Sonnet batch ran 1.5 hours for 20 cases; sequential ran 3 minutes. The batch API's completion time scales poorly for review-grade inference tasks where the model processes complex Korean legal text. Route bulk classification to Haiku with tool-use enums; route analytical review to Sonnet sequential; reserve Opus for one-shot synthesis where the interpretation determines credibility.
Source Documents
Three regulator publications with different document formats and identification levels:
FSS 심사·감리지적사례 (Source 3, 229 cases, anonymized). FSS quarterly audit-quality findings. Companies are anonymized as “A사”, “B사” — no DART link is possible. 200 of 229 enriched via LLM; 65 had full PDF text extracted; the remainder enriched from metadata only. The largest source by case count and the richest for taxonomy work.
FSS 회계감리결과제재 (Source 2, 71 cases, named). Named-company sanction decisions published as HWP files. 64 of 71 matched to DART corp_codes; 49 have Beneish ratios computed. The source that enables DART cross-reference.
SFC 증선위의결정보 (Source 1, 28 cases, mixed identification). Securities & Futures Commission quarterly decisions in ZIP files containing PDFs. 15 of 28 redacted; 13 named. 6 DART-matched; 11 with Beneish ratios.
Defensible Claims
The bias validation produced four claims with documented precision:
| Component | Maps to | Precision (post-repair, n=65) |
|---|---|---|
| SGI | revenue_fabrication | 95% (21/22) |
| AQI | asset_inflation | 90% (19/21) |
| DSRI | revenue_fabrication (supporting) | 86% (19/22) |
| LVGI | liability_suppression | 89% (8/9) |
TATA is not in this table. It fired at 100% before prompt repair and 34% after — diffuse across violation types. The repair was one sentence in the prompt; finding the problem required five validation phases. GMI shows a model-dependent rate (Haiku 10%, Sonnet 29%) and is cited with a caveat.
What the Dataset Cannot Do
The anonymized FSS cohort (229 cases) cannot be DART-linked. A proper supervised model requires a non-fraud control group with comparable Beneish ratios — building that requires joining kr-company-registry and kr-beneish against a representative non-fraud universe. The methodology is documented in reports/ml-feasibility-and-next-steps.md. The dataset supports descriptive empirical work today; supervised classification requires the matched-control construction first.