Korean Presale Transfer Window Scanner
A data pipeline that crosses 청약홈 presale lottery records with K-APT construction permit data to surface apartments where 분양권 전매제한 (presale transfer restrictions) have expired — a transaction window that 청약홈 never surfaces directly. 6 APIs confirmed live. 10/10 join test passed.
Overview
Under 주택법 §64 → 시행령 §73 → 별표3, presale transfer restrictions in 수도권 규제지역 expire automatically 3 years after 당첨자발표일. 청약홈 (the official presale registry) records the 당첨자발표일 for every complex but never publishes expiry dates or flags when restrictions have lapsed. Brokers specializing in 분양권 전매 identify these windows by manually cross-referencing 청약홈 records against construction progress — a labor-intensive process across thousands of complexes. This project builds the pipeline that automates the cross-reference: a 4-stage API join chain that computes restriction expiry dates, verifies construction status via K-APT, and flags complexes where 분양권 still exist but the restriction window has closed.
Problem
분양권 전매 (presale rights transfer) brokers operate in a legally precise window. The restriction is automatic and statute-defined — it does not require LH approval or any official notification — but the expiry date is never published in a structured format anywhere. A broker working the 수도권 규제지역 market must manually track hundreds of complexes, cross-check 청약홈 records for 당첨자발표일, calculate the 3-year clock, and verify that 소유권이전등기 hasn't yet occurred (which would extinguish the 분양권). No tool does this. The broker who has this data has a structural information advantage over their clients and competitors.
Constraints
- 청약홈 API publishes 당첨자발표일 but not expiry dates or restriction status — the expiry date must be computed (당첨자발표일 + 3 years) and the restriction status must be inferred from construction data
- K-APT complex registry (공동주택관리정보시스템) only covers completed or near-completion complexes —착공-delayed units (3기 신도시 사전청약 apartments) are intentionally absent from K-APT, and their absence is the signal that 소유권이전등기 is impossible and 분양권 still exists
- Name matching between 청약홈 complex names and K-APT records is ambiguous in dense areas — multiple complexes share address tokens, requiring AI disambiguation for multi-candidate cases
- 주택인허가 API startDate/endDate parameters are silently ignored by the government endpoint — date filtering must be applied post-fetch on the returned records
Approach
A 4-stage join chain. Stage 1: fetch 청약홈 APT 분양정보 for 수도권 complexes with 당첨자발표일 in target window. Stage 2: match to K-APT 단지 목록 via address-weighted fuzzy name matching (SequenceMatcher with FUZZY_THRESHOLD=0.65). Stage 3: fetch 공동주택 기본정보 for 사용승인일 (proxy for construction completion). Stage 4: fetch 주택인허가 기본개요 for actual 착공일 where the K-APT record is ambiguous. Complexes absent from K-APT after name matching are flagged as 착공-delayed — their 분양권 is intact. The pipeline runs 10 concurrent workers with an AI match cache that eliminates redundant calls across re-runs. Claude Haiku (not Sonnet) handles ambiguous multi-candidate K-APT joins after rule-based address pre-filtering resolves the unambiguous cases.
Key Decisions
Absence from K-APT as the 착공 delay signal
3기 신도시 사전청약 complexes are precisely the units where 분양권 are most likely to still exist — construction is delayed, 소유권이전등기 hasn't occurred, and the 3-year clock has often already passed. These units are absent from K-APT because K-APT only covers complexes registered in 공동주택관리정보시스템, which requires construction completion. Absence is not a data quality problem — it is the signal.
- Treat K-APT misses as data gaps and exclude them — misses the most commercially valuable cases
- Use 주택인허가 API for all records — API silently ignores date filters and requires expensive post-fetch filtering; K-APT path is faster for completed complexes
Claude Haiku for K-APT disambiguation, not Sonnet
The disambiguation task is binary: given two K-APT complexes in the same 동 with similar names, which one matches the 청약홈 record? This requires reading an address and a name on each side and making a selection — a classification task. Haiku handles this correctly after the rule-based address pre-filter narrows candidates to 2-3. Sonnet adds cost without improving accuracy on binary classification.
- Length heuristic (shorter complex name = correct match) — fragile when two complexes have similar name lengths in the same district; evaluated and rejected
- Fuzzy threshold only, no AI — threshold of 0.65 misses genuine matches below threshold in ambiguous address areas
FUZZY_THRESHOLD raised from 0.30 to 0.65
At 0.30, brand token coincidences (e.g. two 'e편한세상' complexes in different districts) passed the threshold and sent near-random name pairs to the AI. At 0.65, only genuine near-matches reach AI disambiguation. Matches below 0.65 are treated as complex-absent (착공 delay signal), which is the correct behavior.
- Keep at 0.30 — produces too many spurious AI calls on genuinely absent complexes
Tech Stack
- Python 3.11
- 청약홈 APT 분양정보 OpenAPI (15101046)
- K-APT 공동주택 단지목록 / 기본정보 API
- 주택인허가 기본개요 API (15136560)
- 분양권전매 실거래가 API (15126471)
- Claude Haiku (AI disambiguation — ambiguous K-APT joins only)
- pandas, Parquet
- python-dotenv
Result & Impact
- 6APIs confirmed live
- 10 / 10 passedJoin test result
- 293,000+주택인허가 records fetched
- Built (parquet)AI disambiguation cache
The join chain correctly identifies completed complexes via the K-APT path and correctly flags 착공-delayed complexes as K-APT-absent. The 10/10 join test used empirically verified completed 서울 complexes (당첨자발표일 2020–2022) — units where kaptUsedate is confirmed present. The absence behavior for 3기 신도시 units is a design property confirmed as correct.
Learnings
- Government API documentation describes what the API is supposed to return; live probe results describe what it actually returns. The 주택인허가 API silently ignores date parameters documented as valid — discovered only by running the API and observing that different date ranges returned identical result sets. Always probe before designing around an API parameter.
- Absence from a registry is a signal, not a data gap, when the registry's coverage rules are understood. K-APT's coverage rules (only completed complexes) are well-documented, making absence a precise, interpretable signal rather than ambiguity.
- Binary AI classification tasks (which of these two K-APT records matches?) need a weaker model than synthesis tasks. Routing Haiku to selection and reserving Sonnet for generation-level tasks was the right cost-accuracy tradeoff here.
- FUZZY_THRESHOLD is a hyperparameter that determines how many cases the AI sees and how many are treated as definitively absent. Setting it too low floods the AI with false positives; setting it too high misses genuine near-matches. The right value requires empirical calibration against test units, not a guess.