topmark.filetypes.model¶
File type definitions and matching behavior for TopMark.
This module defines FileType, the value object used by the registry and
resolver to recognize files and decide whether they are eligible for TopMark
header processing.
A file type can match files by extension, filename rule, regular expression, and
optional content probing. Filename rules are declarative registry matching rules,
not filesystem paths. They are normalized and validated when a FileType is
constructed so matching, registry output, and machine-readable serialization use
the same canonical representation on every platform.
ContentGate ¶
Bases: Enum
Policy that controls when a FileType.content_matcher may run.
Use a gate to prevent accidental matches (e.g., Markdown containing //).
Most overlay types like JSON-with-comments should use IF_EXTENSION so that
content probing only occurs when the file already looks like the family by
extension.
Attributes:
| Name | Type | Description |
|---|---|---|
NEVER |
Never evaluate the content matcher. |
|
IF_EXTENSION |
Probe content only if an extension matched. |
|
IF_FILENAME |
Probe content only if a filename/tail matched. |
|
IF_PATTERN |
Probe content only if a regex pattern matched. |
|
IF_ANY_NAME_RULE |
Probe if any name rule matched (extension OR filename OR pattern). |
|
IF_NONE |
Probe only when the type has no name rules declared (pure content types). |
|
ALWAYS |
Always evaluate the content matcher (use sparingly). |
ContentMatcher ¶
Bases: Protocol
Protocol for content-based file type matchers.
A content matcher is a callable that inspects a file's contents to determine if it matches a specific file type. This is useful for file types that cannot be reliably identified by name alone. The matcher should be fast, side-effect free, and return True if the file is of the expected type.
InsertCapability ¶
Bases: Enum
Advisory on whether a header insertion is advisable in the current context.
Attributes:
| Name | Type | Description |
|---|---|---|
UNEVALUATED |
No checker result yet. |
|
OK |
Insertion is advisable. |
|
SKIP_UNSUPPORTED_CONTENT |
Insertion should be skipped because the file content is not suitable (e.g., XML prolog-only files). |
|
SKIP_POLICY |
Insertion should be skipped due to policy (e.g., file type configured to skip processing). |
|
SKIP_READONLY |
Insertion should be skipped because the file is read-only (future use; not implemented yet). |
|
SKIP_IDEMPOTENCE_RISK |
Skip because we cannot guarantee insert→strip idempotence (e.g., insertion would reflow a physical line or introduce ambiguous blank-line padding). |
|
SKIP_OTHER |
Insertion should be skipped for other reasons (e.g., pre-insert checks failed). |
InsertCheckResult ¶
Bases: TypedDict
Result of a pre-insert check.
Attributes:
| Name | Type | Description |
|---|---|---|
capability |
InsertCapability
|
Advisory on whether insertion is OK or should be skipped (and why). |
reason |
str
|
Human-readable explanation for the advisory. |
origin |
str
|
Origin of the result |
PreInsertHeaderProcessorView ¶
Bases: Protocol
Read-only header-processor surface needed by pre-insert checkers.
get_header_insertion_char_offset ¶
PreInsertContextView ¶
Bases: Protocol
Minimal view of ProcessingContext for pre-insert checkers.
This protocol defines the minimal set of attributes that a ProcessingContext must have to be used by pre-insert checkers. It allows checkers to be defined without depending on the full ProcessingContext class.
Attributes are exposed as read-only properties so pre-insert checkers can inspect context state without mutating pipeline-owned data.
InsertChecker ¶
Bases: Protocol
Protocol for pre-insert checkers associated with a FileType.
A pre-insert checker is a callable that inspects the current processing context before a header insertion is attempted. It can advise whether insertion is advisable, should be skipped, or is outright disallowed. The checker receives a minimal view of the ProcessingContext to avoid unnecessary dependencies.
FileType
dataclass
¶
FileType(
*,
local_key,
namespace,
extensions,
filenames,
patterns,
description,
skip_processing=False,
content_matcher=None,
content_gate=ContentGate.NEVER,
header_policy=FileTypeHeaderPolicy(),
pre_insert_checker=None,
_compiled_patterns=None,
)
Represents a file type recognized by TopMark.
A file type describes how TopMark recognizes files on disk and whether they
are eligible for header processing. Recognition can be based on filename
extension, exact filename, regex pattern, and optionally file content via
content_matcher.
Attributes:
| Name | Type | Description |
|---|---|---|
local_key |
str
|
Internal identifier of the file type (e.g. |
namespace |
str
|
FileType namespace. |
extensions |
list[str]
|
List of filename extensions associated with this type. Values
should include the leading dot (e.g. |
filenames |
list[str]
|
Exact basename or relative tail-subpath matching rules. Rules
are registry identifiers, not filesystem paths. Backslashes are
accepted as compatibility input and normalized to POSIX-style |
patterns |
list[str]
|
Regular expressions evaluated against the basename only with
|
description |
str
|
Human-readable description of the file type. |
skip_processing |
bool
|
When |
content_matcher |
ContentMatcher | None
|
Optional content matcher
that performs content-based recognition when name-based heuristics are
ambiguous. TopMark calls this last in |
content_gate |
ContentGate
|
Gate that controls when the content matcher is consulted. |
header_policy |
FileTypeHeaderPolicy
|
Optional |
pre_insert_checker |
InsertChecker | None
|
Optional pre-insert checker: "may we add a TopMark header here?" |
Content-based recognition (example)¶
A practical use case is differentiating commented JSON (CJSON) from plain
JSON. File names like config.json might be CJSON (supports line comments)
and thus can carry TopMark headers. You can provide a CJSON file type with a
content_matcher that inspects the file for comment tokens (e.g. // or
/* ... */) while avoiding naïve false positives:
def looks_like_cjson(path: Path) -> bool: try: text = path.read_text(encoding="utf-8", errors="ignore") except OSError: return False # Heuristics: allow // or / / outside of strings (simple check) return "//" in text or "/*" in text
Registering such a file type makes TopMark recognize these files; pairing it with a suitable header processor makes them supported for processing.
Notes
matchesfirst tries extensions, filenames, and regex patterns. Only if those fail andcontent_matcheris set will it call the matcher to decide.filenamesentries are normalized and validated during construction. Tail-subpath rules are matched againstpath.as_posix(), but the rules themselves are stored canonically and should be serialized as POSIX-style strings.- Content matchers should read a small portion of the file where possible to remain fast on large trees. The current implementation leaves that policy to the callable to keep the base class simple.
qualified_key
property
¶
Return the qualified identity key for this file type instance.
Format: "<namespace>:<local_key>".
matches ¶
Determine whether this file type matches a path.
Matching is attempted in deterministic order: extensions, filename
rules, regex patterns, and then an optional content matcher if permitted
by content_gate. Exact basename filename rules compare against
path.name; relative tail-subpath filename rules compare against
path.as_posix().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to test against this file type. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the path matches this file type, False otherwise. Content |
bool
|
matcher exceptions are caught and treated as non-matches. |
Source code in src/topmark/filetypes/model.py
376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 | |