File type resolution and ambiguity policy¶
This page documents how TopMark resolves a concrete filesystem path to the most specific matching
FileType, and then to the bound
HeaderProcessor registered for that file type.
Resolver behavior is deterministic and operates on canonical qualified file type identities such as
topmark:python.
Note
The canonical vocabulary used throughout the documentation is defined in Terminology and Canonical Vocabulary.
It complements the registry architecture described in registry-model.md:
- registries define what exists
- the resolver defines what wins for a concrete path
This resolver operates within the broader TOML →
FrozenConfig → runtime architecture (see
architecture.md). It consumes the effective composed runtime registry state and
does not perform configuration discovery, layered configuration provenance export, or staged
config-loading validation strictness resolution itself.
In particular, source-local TOML options such as [config].root and [config].strict are resolved
before runtime file-type resolution and probing begin. They influence discovery and staged
config-loading validation behavior, but are not part of the resolver's matching or tie-break logic.
Note
[config].strict is a TOML-source-local strictness preference controlling staged
configuration-loading validation for the current TOML source.
Effective strictness is evaluated across:
- TOML-source diagnostics;
- merged-config diagnostics;
- runtime applicability diagnostics.
strict is resolved during TOML loading and does not become a layered configuration field.
This distinction is also visible in
topmark config dump --show-layers: layered provenance exports
are produced earlier from resolved TOML sources and the flattened compatibility view, while
file-type resolution happens later against the already-validated effective runtime configuration.
Overview¶
TopMark has two different resolution modes:
- Identifier-based lookup resolves file types or processors from explicit local or qualified identifiers through the registries.
- Path-based resolution resolves a real path by evaluating extension, filename, pattern, and optional content-based signals.
Registry-facing APIs normalize identifiers to canonical qualified keys before resolver and binding operations.
Before path-based resolution runs, TopMark performs file discovery and filtering. Paths excluded at that stage do not participate in candidate generation or scoring.
File-type filters accept both local identifiers such as python and canonical qualified identifiers
such as topmark:python.
Resolver filtering operates on canonical qualified file type identities.
Path-based resolution is implemented in
topmark.resolution.filetypes and consumed by
ResolverStep.
The main public entry points are:
These entry points participate only in path-based runtime resolution and probing. They do not
surface or consume layered config provenance payloads such as the human-facing [[layers]] export
or the machine-readable config_provenance payload used by topmark config dump --show-layers.
They operate after staged config-loading validation has completed and the effective runtime configuration is finalized.
Note
Internal helper types such as PolicyOverrides and
ConfigOverrides are not part of the stable public
API surface. They are internal runtime orchestration helpers used by the CLI and public API
wrappers.
Public callers should pass plain mapping-based inputs through config=..., policy=..., and
policy_by_type=... instead of constructing these objects directly.
At this layer, path-based resolution consumes the already-finalized runtime configuration and effective composed runtime registry state.
Public API callers provide mapping-based inputs; internal typed runtime override objects are introduced earlier by CLI/API orchestration and are not part of the resolver contract.
See also:
ArchitectureRegistry modelPipelines (Concepts)Pipelines (Reference)Configuration discoveryConfiguration indexConfigurationFilteringCLI overview
Resolution pipeline boundaries¶
TopMark intentionally separates:
- discovery filtering
- runtime configuration resolution
- registry composition
- runtime file-type probing
- deterministic winner selection
- pipeline execution
Each stage operates on the finalized outputs of the previous stage.
This layered architecture keeps runtime resolution deterministic while preserving observability, stable machine-readable diagnostics, and explicit configuration/runtime boundaries.
Probe-based resolution (1.0 contract)¶
TopMark 1.0 exposes a probe-first resolution model via
probe_resolution_for_path().
Probe-based resolution operates only after discovery filtering and configuration normalization have completed.
This function returns a ResolutionProbeResult
containing:
- selected file type and processor (if any)
- probe status and reason
- all scored candidate file types
- match signals used during resolution
- filtered explicit inputs that did not reach file-type probing
The probe result is the canonical source of truth for runtime resolution decisions once a path
reaches file-type probing. Explicit inputs may be filtered earlier during discovery; those cases are
represented as synthetic probe results with status="filtered" and one of:
reason="excluded_by_path_filter"reason="excluded_by_file_type_filter"
This includes canonicalized file-type filtering using normalized qualified identifiers.
reason="excluded_by_discovery_filter"(fallback when the exact category is not identified)ResolverStepconsumesctx.resolution_probeand maps it to pipeline stateProberStepexposes the same data fortopmark probe
This unifies:
- human output (TEXT / Markdown)
- machine-readable output (JSON / NDJSON)
- pipeline runtime-resolution behavior
Callers should use
probe_resolution_for_path() when they
need path-based resolution details.
For stable integrations, prefer topmark.api.probe(), which returns
normalized public DTOs.
probe_resolution_for_path() is an
advanced helper that exposes internal runtime probe structures and is not part of the
topmark.api compatibility contract.
Note that probe_resolution_for_path()
only applies to paths that passed discovery filtering. The
topmark probe command augments these results with discovery-level
explanations for explicitly requested paths that were filtered before probing.
Candidate generation¶
Candidate generation is performed by
get_file_type_candidates_for_path().
For each effective FileType, the resolver evaluates
name-based signals and, when allowed, optional content-based signals.
Candidate generation operates against the effective composed runtime registry.
Name-based signals¶
The resolver computes three name-based match signals:
- extension: the basename ends with one of the file type's configured extensions
- filename: the basename or normalized path tail matches one of the file type's configured filenames
- pattern: the basename fully matches one of the file type's configured regular-expression patterns
These signals are represented by MatchSignals.
Content gating¶
Content probing is controlled by the file type's
ContentGate. This prevents unrelated files from being
probed unnecessarily and allows specialized overlay-style file types to refine generic matches.
Examples:
ContentGate.NEVERdisables content probing entirelyContentGate.IF_EXTENSIONonly allows probing when an extension matchedContentGate.IF_FILENAMEonly allows probing when a filename or tail matchedContentGate.IF_PATTERNonly allows probing when a pattern matchedContentGate.IF_ANY_NAME_RULEallows probing when any name-based rule matchedContentGate.IF_NONEallows probing only when the file type declares no name-based rulesContentGate.ALWAYSallows content probing unconditionally
Candidate inclusion¶
A file type becomes a candidate when its evaluated signals satisfy the runtime resolver's inclusion rules.
This means that:
- a candidate may be included purely from name-based signals
- a candidate may be included only after a successful content probe
- a candidate may be excluded even when some name-based signals matched if the configured content gate requires a positive content hit
Candidate generation may therefore yield multiple file types for the same path. This is intentional and is handled by the deterministic selection policy described below.
Scoring model¶
Each included candidate is assigned a precedence score by _score_file_type_candidate().
Higher scores are better.
The current precedence model is:
- explicit filename or filename-tail match
- content-confirmed match
- pattern match
- extension match
A small bonus is applied to file types that are not marked skip_processing=True, which gives
header-capable types a stable advantage on otherwise equal matches.
More specifically:
- filename and path-tail matches receive the highest scores and become more specific as the matched tail becomes longer
- content-confirmed matches outrank generic pattern and extension matches
- pattern matches outrank plain extension matches
- extension matches remain valid fallbacks for generic formats
The scoring model is intentionally biased toward the most specific match, while still keeping generic file types useful as fallbacks.
Deterministic selection¶
Final selection is handled by _select_best_file_type_candidate().
TopMark does not treat multiple candidates as an error. Instead, it applies a deterministic
ordering key defined by
candidate_order_key().
Candidates are ordered by:
- score (descending)
- namespace (ascending)
- local key (ascending)
The winning file type is therefore stable for a given:
- effective composed runtime registry state
- path and filename
- file content
- configuration and filtering state
In practice, this means:
- the highest-scoring candidate wins
- if multiple candidates have the same score, namespace is used as the first stable tie-breaker
- if score and namespace are equal, local key is used as the final stable tie-breaker
This policy guarantees that the same path, content, and effective registry state always produce the
same winning FileType.
TopMark uses a deterministic winner-selection policy rather than an ambiguity-error policy.
Ambiguity policy¶
Resolution may produce multiple matching file type candidates. This is not considered a registry error.
This runtime resolution ambiguity policy is distinct from identifier ambiguity handling.
Identifier ambiguity occurs when a local identifier such as python resolves to multiple file types
in the effective registry. In those situations, callers must use canonical qualified identifiers
such as topmark:python.
Overlap between file types is allowed because it enables useful patterns such as:
- a generic built-in file type plus a more specific plugin-defined variant
- a content-refined overlay type (for example, a JSON-like subtype over a generic JSON fallback)
- shared extensions with different filename or content rules
TopMark's ambiguity policy is therefore:
- multiple candidates are allowed during candidate generation
- the resolver must return at most one effective winner
- the winner is selected deterministically using the documented precedence and tie-break policy
- ambiguity does not raise an exception in the stable 1.x resolution model
This keeps resolution stable and practical while still allowing rich, overlapping file type ecosystems.
Logging and observability¶
When multiple candidates share the top score,
probe_resolution_for_path() records
the tie-break outcome in the returned
ResolutionProbeResult.
The probe result surfaces the full candidate set, scores, match signals, selected candidate, and
reason (selected_highest_score or selected_by_tie_break). This makes resolution decisions fully
observable without relying on debug logging alone.
For explicitly requested paths that were filtered before probing, observability is provided through
synthetic probe results emitted by topmark probe, rather than
through candidate-scoring data. The reported reason distinguishes whether the path was excluded by
path filters, file-type filters, or a generic discovery-filter fallback.
This makes ambiguous-but-resolvable situations observable during development and debugging without turning them into hard failures.
Identifier normalization and file-type filter decisions are also observable through probe and runtime-resolution diagnostics.
The log includes:
- the path being resolved
- the shared top score
- the qualified keys of the tied top candidates
This helps explain why a particular file type identity won when multiple strong candidates existed.
Design rationale¶
TopMark intentionally resolves ambiguity in the resolver layer rather than in the registries.
This separation keeps responsibilities clear:
FileTypeRegistrystores file type identities and canonical identifier resolutionHeaderProcessorRegistrystores processor identitiesBindingRegistrystores effective file-type-to-processor relationships- the runtime resolver decides which file type best matches a concrete filesystem path
This design has several advantages:
- registries remain simple and declarative
- overlapping file types remain legal
- runtime resolution remains deterministic and testable
- plugin authors can define specialized file types without needing a separate override system in the registries
Non-goals¶
The current resolver deliberately does not provide:
- user-configurable namespace priority
- a strict ambiguity error mode
- registry-time rejection of overlapping file type definitions
- fuzzy matching or implicit namespace fallback for identifier resolution
- pluggable custom precedence strategies
These may be introduced post-1.0 if there is a strong use case, but they are not part of the current TopMark stable 1.x runtime-resolution contract.
Possible future extensions¶
Possible future improvements include:
- a strict mode that surfaces certain ambiguities as explicit resolution errors
- user-configurable precedence overrides
- richer diagnostics or hints when deterministic tie-breaks are used
- plugin-defined precedence policies layered on top of the default scoring model
- richer probe diagnostics and scoring transparency in machine-readable output
For the stable 1.x line, the documented deterministic policy on this page is the source of truth.
See also¶
Architecture- registry design and system overviewRegistry model- registry layers, bindings, overlays, and identifier semanticsTerminology and Canonical Vocabulary- canonical definitions for identifiers, applicability, ambiguity, and machine-readable terminologyPlugins- how file types and processors are registeredMachine-readable output- how resolution results surface in JSON and NDJSON outputsConfiguration- canonical file-type identifier semanticsFiltering- discovery and file-type filter behaviorCLI overview- resolver-related CLI commands and filtering options