JFIA Article Catalog

Builder · 2026 · 6 min read

Structured JSON index of all 469 articles published in the Journal of Forensic & Investigative Accounting (2009–2025) across 46 issues — titles, authors, abstracts, keywords, and direct PDF links. The only machine-readable catalog of JFIA; upstream data source for jfia-forensic detectlet schemas and krff-shell natural-language search.

Overview

A scraped, structured catalog of every article in the Journal of Forensic & Investigative Accounting (JFIA), 2009–2025. JFIA is the primary peer-reviewed venue for forensic accounting research — fraud detection, earnings manipulation, disclosure timing, insider networks — published by NACVA. The catalog provides 469 article records with titles, author lists, abstracts (363/469), keyword tags (242/469), and direct PDF links. It is the upstream data source for the jfia-forensic detectlet library, where each detectlet's `jfia_citations` field points to a specific catalog record, and for krff-shell Tool 11 (JFIA natural-language search).

Problem

JFIA does not publish a machine-readable article index. Researchers searching for articles on a specific forensic accounting topic must manually browse 46 issue pages or rely on Google Scholar's incomplete coverage of a niche journal. Building the jfia-forensic detectlet library required reliable article-level metadata: specifically, which JFIA articles provided the empirical basis for each detection signal and what threshold values they reported. Without a structured catalog, each detectlet's citation fields required per-article manual lookup — a recurring maintenance burden as new issues are published.

Constraints

  • NACVA does not provide an API — all metadata must be extracted from HTML issue pages via scraping
  • Abstract coverage is 77.4% (363/469) — early issues (2009–2012) had minimal abstract markup in NACVA's page structure; the remaining 22.6% are abstract-absent at the source
  • Keyword coverage is 51.6% (242/469) — keyword fields are sparser than abstracts, particularly for pre-2014 issues; keyword-absent articles are not retrievable by keyword search in downstream tools
  • PDF links follow an S3 URL pattern that has been stable since 2009 but is not guaranteed — re-scraping after a NACVA infrastructure change may require URL pattern updates
  • Catalog is a point-in-time snapshot; JFIA publishes approximately quarterly and the catalog must be manually re-run to include new issues

Approach

Built `JFIA_metadata_scraper.py` using requests + BeautifulSoup to crawl all 46 NACVA issue pages. Each article record captures: index, title, authors (as list), abstract, keywords (as list), and PDF URL. Output is `jfia_catalog.json` (681 KB, UTF-8) structured as nested dicts: top-level metadata → issues list → articles list, with issue-level fields (volume, issue number, period, special-issue flag) preserved at the issue layer. Post-scrape normalization addressed author name format inconsistencies between issues (last–initial vs. first–last ordering) and repaired 14 malformed PDF URLs from issues where NACVA's markup deviated from the standard pattern.

Key Decisions

Nested JSON (issues → articles) rather than a flat CSV

Reasoning:

Issue-level metadata (volume, issue number, period string, special-issue flag) is needed by downstream tools that generate properly formatted APA-style citations. A flat CSV could store this by repeating issue fields on every article row, but that duplicates data in a way that complicates the scraper's update logic. The nested structure mirrors the actual information hierarchy: an issue contains articles, and citation formatting requires both levels.

Alternatives considered:
  • Flat CSV with repeated issue columns per article row — simpler to load with pandas; loses the structural integrity that prevents partial-issue corruption
  • SQLite with separate issues and articles tables — appropriate if query complexity warranted a relational model; over-engineered for a read-only catalog consumed by two downstream tools

Include direct PDF links rather than issue page URLs

Reasoning:

Downstream users (detectlet researchers, krff-shell JFIA search) need the PDF, not the issue landing page. Providing the issue page URL shifts a navigation step onto every consumer. The S3 PDF URL pattern has been stable for 15+ years and is the link NACVA itself uses in citation metadata. If the S3 pattern changes, both the issue page URL and the PDF URL would need updating — there is no resilience advantage to storing the indirect link.

Alternatives considered:
  • Store issue page URL only — requires one additional scrape step per article to reach the PDF; breaks the self-contained utility of the catalog

Scraper as a standalone script, not an importable library

Reasoning:

The catalog needs re-running approximately quarterly when new issues are published. A standalone script with no runtime dependencies beyond requests and BeautifulSoup can be re-run by anyone with a Python environment. Packaging it as a library introduces version pinning constraints: the scraper's parser logic may need to change if NACVA updates their page markup, and a library with a frozen version pinned to the last scrape would silently produce stale output after such changes.

Alternatives considered:
  • Library with a JFIAScraper class — more testable; adds complexity for a function that runs once per quarter and whose primary consumer is a human checking the output

Tech Stack

  • Python
  • requests
  • BeautifulSoup4 (HTML parsing)
  • json (stdlib)
  • pytest

Result & Impact

  • 469
    Articles indexed
  • 46 (2009–2025)
    Issues covered
  • 363 (77%)
    Articles with abstracts
  • 242 (52%)
    Articles with keywords
  • 681 KB (jfia_catalog.json)
    Dataset size

The only machine-readable index of JFIA articles. Eliminates per-article manual lookup for detectlet citation research and enables the natural-language JFIA search in krff-shell (Tool 11). Each jfia-forensic detectlet's `jfia_citations` field is populated from this catalog — the catalog is the bibliographic infrastructure that makes the detectlet library auditable.

Learnings

  • NACVA's inconsistent HTML markup across 16 years of issues required per-issue structural inspection during scraper development. Abstracting the parser too early would have obscured the per-issue variations that needed special-case handling — specifically the abstract and keyword field patterns that changed between the 2009–2012 and post-2014 issue formats.
  • Keyword coverage is the quality ceiling for semantic search: 48% of articles have no keyword tags and cannot be retrieved by keyword query. Any downstream search interface must implement abstract-based text search as the primary fallback — keyword filtering alone would miss nearly half the catalog.
  • The catalog's value compounds as the downstream jfia-forensic library matures. Each new detectlet adds a `jfia_citations` entry that cross-references a catalog record — transforming the catalog from a one-time data scrape into a living bibliographic reference that tracks which forensic signals have published empirical support.

Structure

jfia_catalog.json stores the full catalog as a nested object:

{
  "scraped_at": "...",
  "total_articles": 469,
  "issues": [
    {
      "volume": 1,
      "issue": 1,
      "period": "2009 Q1",
      "is_special_issue": false,
      "articles": [
        {
          "index": 1,
          "title": "Bernard Madoff and the Solo Auditor Red Flag",
          "authors": ["Ross D. Fuerman"],
          "abstract": "...",
          "keywords": ["Madoff", "solo auditor", "fraud"],
          "pdf_url": "https://s3.amazonaws.com/web.nacva.com/JFIA/Issues/JFIA-2009-1_1.pdf"
        }
      ]
    }
  ]
}

Coverage

MetricCount
Total articles469
Issues46
Date range2009–2025
Articles with abstracts363 (77%)
Articles with keywords242 (52%)

Abstract and keyword coverage drops sharply before 2013 — early NACVA issue pages used a minimal markup format that omitted both fields for many articles. Post-2014 coverage is near-complete.

Role in the Toolkit

The catalog is consumed by two downstream components:

jfia-forensic — each detectlet YAML includes a jfia_citations list. Each entry points to a specific catalog article (matched by title) and records the threshold value that article reported. This makes the detectlet library auditable: every detection signal traces to a published empirical source.

krff-shell Tool 11 — the JFIA natural-language search tool loads jfia_catalog.json at startup and runs keyword + abstract full-text matching against the loaded records. A query like “Beneish M-Score Korean” returns ranked article records with titles, authors, abstracts, and direct PDF links — no browser navigation required.