Skip to content

File type resolution and ambiguity policy

This page documents how TopMark resolves a concrete filesystem path to the most specific matching FileType, and then to the bound HeaderProcessor registered for that file type.

Resolver behavior is deterministic and operates on canonical qualified file type identities such as topmark:python.

Note

The canonical vocabulary used throughout the documentation is defined in Terminology and Canonical Vocabulary.

It complements the registry architecture described in registry-model.md:

  • registries define what exists
  • the resolver defines what wins for a concrete path

This resolver operates within the broader TOML → FrozenConfig → runtime architecture (see architecture.md). It consumes the effective composed runtime registry state and does not perform configuration discovery, layered configuration provenance export, or staged config-loading validation strictness resolution itself.

In particular, source-local TOML options such as [config].root and [config].strict are resolved before runtime file-type resolution and probing begin. They influence discovery and staged config-loading validation behavior, but are not part of the resolver's matching or tie-break logic.

Note

[config].strict is a TOML-source-local strictness preference controlling staged configuration-loading validation for the current TOML source.

Effective strictness is evaluated across:

  • TOML-source diagnostics;
  • merged-config diagnostics;
  • runtime applicability diagnostics.

strict is resolved during TOML loading and does not become a layered configuration field.

This distinction is also visible in topmark config dump --show-layers: layered provenance exports are produced earlier from resolved TOML sources and the flattened compatibility view, while file-type resolution happens later against the already-validated effective runtime configuration.


Overview

TopMark has two different resolution modes:

  • Identifier-based lookup resolves file types or processors from explicit local or qualified identifiers through the registries.
  • Path-based resolution resolves a real path by evaluating extension, filename, pattern, and optional content-based signals.

Registry-facing APIs normalize identifiers to canonical qualified keys before resolver and binding operations.

Before path-based resolution runs, TopMark performs file discovery and filtering. Paths excluded at that stage do not participate in candidate generation or scoring.

File-type filters accept both local identifiers such as python and canonical qualified identifiers such as topmark:python.

Resolver filtering operates on canonical qualified file type identities.

Path-based resolution is implemented in topmark.resolution.filetypes and consumed by ResolverStep.

The main public entry points are:

These entry points participate only in path-based runtime resolution and probing. They do not surface or consume layered config provenance payloads such as the human-facing [[layers]] export or the machine-readable config_provenance payload used by topmark config dump --show-layers.

They operate after staged config-loading validation has completed and the effective runtime configuration is finalized.

Note

Internal helper types such as PolicyOverrides and ConfigOverrides are not part of the stable public API surface. They are internal runtime orchestration helpers used by the CLI and public API wrappers.

Public callers should pass plain mapping-based inputs through config=..., policy=..., and policy_by_type=... instead of constructing these objects directly.

At this layer, path-based resolution consumes the already-finalized runtime configuration and effective composed runtime registry state.

Public API callers provide mapping-based inputs; internal typed runtime override objects are introduced earlier by CLI/API orchestration and are not part of the resolver contract.

See also:


Resolution pipeline boundaries

TopMark intentionally separates:

  1. discovery filtering
  2. runtime configuration resolution
  3. registry composition
  4. runtime file-type probing
  5. deterministic winner selection
  6. pipeline execution

Each stage operates on the finalized outputs of the previous stage.

This layered architecture keeps runtime resolution deterministic while preserving observability, stable machine-readable diagnostics, and explicit configuration/runtime boundaries.


Probe-based resolution (1.0 contract)

TopMark 1.0 exposes a probe-first resolution model via probe_resolution_for_path().

Probe-based resolution operates only after discovery filtering and configuration normalization have completed.

This function returns a ResolutionProbeResult containing:

  • selected file type and processor (if any)
  • probe status and reason
  • all scored candidate file types
  • match signals used during resolution
  • filtered explicit inputs that did not reach file-type probing

The probe result is the canonical source of truth for runtime resolution decisions once a path reaches file-type probing. Explicit inputs may be filtered earlier during discovery; those cases are represented as synthetic probe results with status="filtered" and one of:

  • reason="excluded_by_path_filter"
  • reason="excluded_by_file_type_filter"

This includes canonicalized file-type filtering using normalized qualified identifiers.

  • reason="excluded_by_discovery_filter" (fallback when the exact category is not identified)
  • ResolverStep consumes ctx.resolution_probe and maps it to pipeline state
  • ProberStep exposes the same data for topmark probe

This unifies:

  • human output (TEXT / Markdown)
  • machine-readable output (JSON / NDJSON)
  • pipeline runtime-resolution behavior

Callers should use probe_resolution_for_path() when they need path-based resolution details.

For stable integrations, prefer topmark.api.probe(), which returns normalized public DTOs.

probe_resolution_for_path() is an advanced helper that exposes internal runtime probe structures and is not part of the topmark.api compatibility contract.

Note that probe_resolution_for_path() only applies to paths that passed discovery filtering. The topmark probe command augments these results with discovery-level explanations for explicitly requested paths that were filtered before probing.


Candidate generation

Candidate generation is performed by get_file_type_candidates_for_path().

For each effective FileType, the resolver evaluates name-based signals and, when allowed, optional content-based signals.

Candidate generation operates against the effective composed runtime registry.

Name-based signals

The resolver computes three name-based match signals:

  • extension: the basename ends with one of the file type's configured extensions
  • filename: the basename or normalized path tail matches one of the file type's configured filenames
  • pattern: the basename fully matches one of the file type's configured regular-expression patterns

These signals are represented by MatchSignals.

Content gating

Content probing is controlled by the file type's ContentGate. This prevents unrelated files from being probed unnecessarily and allows specialized overlay-style file types to refine generic matches.

Examples:

  • ContentGate.NEVER disables content probing entirely
  • ContentGate.IF_EXTENSION only allows probing when an extension matched
  • ContentGate.IF_FILENAME only allows probing when a filename or tail matched
  • ContentGate.IF_PATTERN only allows probing when a pattern matched
  • ContentGate.IF_ANY_NAME_RULE allows probing when any name-based rule matched
  • ContentGate.IF_NONE allows probing only when the file type declares no name-based rules
  • ContentGate.ALWAYS allows content probing unconditionally

Candidate inclusion

A file type becomes a candidate when its evaluated signals satisfy the runtime resolver's inclusion rules.

This means that:

  • a candidate may be included purely from name-based signals
  • a candidate may be included only after a successful content probe
  • a candidate may be excluded even when some name-based signals matched if the configured content gate requires a positive content hit

Candidate generation may therefore yield multiple file types for the same path. This is intentional and is handled by the deterministic selection policy described below.


Scoring model

Each included candidate is assigned a precedence score by _score_file_type_candidate().

Higher scores are better.

The current precedence model is:

  1. explicit filename or filename-tail match
  2. content-confirmed match
  3. pattern match
  4. extension match

A small bonus is applied to file types that are not marked skip_processing=True, which gives header-capable types a stable advantage on otherwise equal matches.

More specifically:

  • filename and path-tail matches receive the highest scores and become more specific as the matched tail becomes longer
  • content-confirmed matches outrank generic pattern and extension matches
  • pattern matches outrank plain extension matches
  • extension matches remain valid fallbacks for generic formats

The scoring model is intentionally biased toward the most specific match, while still keeping generic file types useful as fallbacks.


Deterministic selection

Final selection is handled by _select_best_file_type_candidate().

TopMark does not treat multiple candidates as an error. Instead, it applies a deterministic ordering key defined by candidate_order_key().

Candidates are ordered by:

  1. score (descending)
  2. namespace (ascending)
  3. local key (ascending)

The winning file type is therefore stable for a given:

  • effective composed runtime registry state
  • path and filename
  • file content
  • configuration and filtering state

In practice, this means:

  • the highest-scoring candidate wins
  • if multiple candidates have the same score, namespace is used as the first stable tie-breaker
  • if score and namespace are equal, local key is used as the final stable tie-breaker

This policy guarantees that the same path, content, and effective registry state always produce the same winning FileType.

TopMark uses a deterministic winner-selection policy rather than an ambiguity-error policy.


Ambiguity policy

Resolution may produce multiple matching file type candidates. This is not considered a registry error.

This runtime resolution ambiguity policy is distinct from identifier ambiguity handling.

Identifier ambiguity occurs when a local identifier such as python resolves to multiple file types in the effective registry. In those situations, callers must use canonical qualified identifiers such as topmark:python.

Overlap between file types is allowed because it enables useful patterns such as:

  • a generic built-in file type plus a more specific plugin-defined variant
  • a content-refined overlay type (for example, a JSON-like subtype over a generic JSON fallback)
  • shared extensions with different filename or content rules

TopMark's ambiguity policy is therefore:

  • multiple candidates are allowed during candidate generation
  • the resolver must return at most one effective winner
  • the winner is selected deterministically using the documented precedence and tie-break policy
  • ambiguity does not raise an exception in the stable 1.x resolution model

This keeps resolution stable and practical while still allowing rich, overlapping file type ecosystems.


Logging and observability

When multiple candidates share the top score, probe_resolution_for_path() records the tie-break outcome in the returned ResolutionProbeResult.

The probe result surfaces the full candidate set, scores, match signals, selected candidate, and reason (selected_highest_score or selected_by_tie_break). This makes resolution decisions fully observable without relying on debug logging alone.

For explicitly requested paths that were filtered before probing, observability is provided through synthetic probe results emitted by topmark probe, rather than through candidate-scoring data. The reported reason distinguishes whether the path was excluded by path filters, file-type filters, or a generic discovery-filter fallback.

This makes ambiguous-but-resolvable situations observable during development and debugging without turning them into hard failures.

Identifier normalization and file-type filter decisions are also observable through probe and runtime-resolution diagnostics.

The log includes:

  • the path being resolved
  • the shared top score
  • the qualified keys of the tied top candidates

This helps explain why a particular file type identity won when multiple strong candidates existed.


Design rationale

TopMark intentionally resolves ambiguity in the resolver layer rather than in the registries.

This separation keeps responsibilities clear:

  • FileTypeRegistry stores file type identities and canonical identifier resolution
  • HeaderProcessorRegistry stores processor identities
  • BindingRegistry stores effective file-type-to-processor relationships
  • the runtime resolver decides which file type best matches a concrete filesystem path

This design has several advantages:

  • registries remain simple and declarative
  • overlapping file types remain legal
  • runtime resolution remains deterministic and testable
  • plugin authors can define specialized file types without needing a separate override system in the registries

Non-goals

The current resolver deliberately does not provide:

  • user-configurable namespace priority
  • a strict ambiguity error mode
  • registry-time rejection of overlapping file type definitions
  • fuzzy matching or implicit namespace fallback for identifier resolution
  • pluggable custom precedence strategies

These may be introduced post-1.0 if there is a strong use case, but they are not part of the current TopMark stable 1.x runtime-resolution contract.


Possible future extensions

Possible future improvements include:

  • a strict mode that surfaces certain ambiguities as explicit resolution errors
  • user-configurable precedence overrides
  • richer diagnostics or hints when deterministic tie-breaks are used
  • plugin-defined precedence policies layered on top of the default scoring model
  • richer probe diagnostics and scoring transparency in machine-readable output

For the stable 1.x line, the documented deterministic policy on this page is the source of truth.


See also