topmark.filetypes.model¶

File type definitions and matching behavior for TopMark.

This module defines FileType, the value object used by the registry and resolver to recognize files and decide whether they are eligible for TopMark header processing.

A file type can match files by extension, filename rule, regular expression, and optional content probing. Filename rules are declarative registry matching rules, not filesystem paths. They are normalized and validated when a FileType is constructed so matching, registry output, and machine-readable serialization use the same canonical representation on every platform.

ContentGate ¶

Bases: Enum

Policy that controls when a FileType.content_matcher may run.

Use a gate to prevent accidental matches (e.g., Markdown containing //). Most overlay types like JSON-with-comments should use IF_EXTENSION so that content probing only occurs when the file already looks like the family by extension.

Attributes:

Name	Type	Description
`NEVER`		Never evaluate the content matcher.
`IF_EXTENSION`		Probe content only if an extension matched.
`IF_FILENAME`		Probe content only if a filename/tail matched.
`IF_PATTERN`		Probe content only if a regex pattern matched.
`IF_ANY_NAME_RULE`		Probe if any name rule matched (extension OR filename OR pattern).
`IF_NONE`		Probe only when the type has no name rules declared (pure content types).
`ALWAYS`		Always evaluate the content matcher (use sparingly).

ContentMatcher ¶

Bases: Protocol

Protocol for content-based file type matchers.

A content matcher is a callable that inspects a file's contents to determine if it matches a specific file type. This is useful for file types that cannot be reliably identified by name alone. The matcher should be fast, side-effect free, and return True if the file is of the expected type.

InsertCapability ¶

Bases: Enum

Advisory on whether a header insertion is advisable in the current context.

Attributes:

Name	Type	Description
`UNEVALUATED`		No checker result yet.
`OK`		Insertion is advisable.
`SKIP_UNSUPPORTED_CONTENT`		Insertion should be skipped because the file content is not suitable (e.g., XML prolog-only files).
`SKIP_POLICY`		Insertion should be skipped due to policy (e.g., file type configured to skip processing).
`SKIP_READONLY`		Insertion should be skipped because the file is read-only (future use; not implemented yet).
`SKIP_IDEMPOTENCE_RISK`		Skip because we cannot guarantee insert→strip idempotence (e.g., insertion would reflow a physical line or introduce ambiguous blank-line padding).
`SKIP_OTHER`		Insertion should be skipped for other reasons (e.g., pre-insert checks failed).

InsertCheckResult ¶

Bases: TypedDict

Result of a pre-insert check.

Attributes:

Name	Type	Description
`capability`	`InsertCapability`	Advisory on whether insertion is OK or should be skipped (and why).
`reason`	`str`	Human-readable explanation for the advisory.
`origin`	`str`	Origin of the result

PreInsertHeaderProcessorView ¶

Bases: Protocol

Read-only header-processor surface needed by pre-insert checkers.

get_header_insertion_char_offset ¶

get_header_insertion_char_offset(original_text)

Return the character insertion offset for positional formats, if any.

Source code in src/topmark/filetypes/model.py

def get_header_insertion_char_offset(self, original_text: str) -> int | None:
    """Return the character insertion offset for positional formats, if any."""
    ...

PreInsertContextView ¶

Bases: Protocol

Minimal view of ProcessingContext for pre-insert checkers.

This protocol defines the minimal set of attributes that a ProcessingContext must have to be used by pre-insert checkers. It allows checkers to be defined without depending on the full ProcessingContext class.

Attributes are exposed as read-only properties so pre-insert checkers can inspect context state without mutating pipeline-owned data.

lines `property` ¶

lines

Streaming access to the file image lines.

newline_style `property` ¶

newline_style

Detected newline style used for insertion decisions.

header_processor `property` ¶

header_processor

Resolved header processor, if one is available.

file_type `property` ¶

file_type

Resolved file type, if one is available.

InsertChecker ¶

Bases: Protocol

Protocol for pre-insert checkers associated with a FileType.

A pre-insert checker is a callable that inspects the current processing context before a header insertion is attempted. It can advise whether insertion is advisable, should be skipped, or is outright disallowed. The checker receives a minimal view of the ProcessingContext to avoid unnecessary dependencies.

FileType `dataclass` ¶

FileType(
    *,
    local_key,
    namespace,
    extensions,
    filenames,
    patterns,
    description,
    skip_processing=False,
    content_matcher=None,
    content_gate=ContentGate.NEVER,
    header_policy=FileTypeHeaderPolicy(),
    pre_insert_checker=None,
    _compiled_patterns=None,
)

Represents a file type recognized by TopMark.

A file type describes how TopMark recognizes files on disk and whether they are eligible for header processing. Recognition can be based on filename extension, exact filename, regex pattern, and optionally file content via content_matcher.

Attributes:

Name	Type	Description
`local_key`	`str`	Internal identifier of the file type (e.g. `"python"`).
`namespace`	`str`	FileType namespace.
`extensions`	`list[str]`	List of filename extensions associated with this type. Values should include the leading dot (e.g. `.py`) or be consistent with the matcher used elsewhere in TopMark.
`filenames`	`list[str]`	Exact basename or relative tail-subpath matching rules. Rules are registry identifiers, not filesystem paths. Backslashes are accepted as compatibility input and normalized to POSIX-style `/` separators during construction. Stored values are always canonical POSIX-style strings. Rules must not be empty, absolute, UNC-like, Windows drive paths, or contain empty, `.` or `..` segments.
`patterns`	`list[str]`	Regular expressions evaluated against the basename only with `re.fullmatch`. Patterns are not path-aware and do not receive POSIX-normalized subpaths; use `filenames` for exact basename or relative tail-subpath rules.
`description`	`str`	Human-readable description of the file type.
`skip_processing`	`bool`	When `True`, the pipeline recognizes files of this type but intentionally skips header processing (e.g. JSON without comments, LICENSE files). This lets discovery work while keeping writes disabled by design.
`content_matcher`	`ContentMatcher \| None`	Optional content matcher that performs content-based recognition when name-based heuristics are ambiguous. TopMark calls this last in `matches` after testing extensions, filenames, and patterns. The callable should be fast, side-effect free, and return `True` if the file is of this type. It must not raise; exceptions are caught and treated as non-matches.
`content_gate`	`ContentGate`	Gate that controls when the content matcher is consulted.
`header_policy`	`FileTypeHeaderPolicy`	Optional `FileTypeHeaderPolicy` that tunes placement (e.g., shebang handling) and scanning windows around the expected insertion anchor.
`pre_insert_checker`	`InsertChecker \| None`	Optional pre-insert checker: "may we add a TopMark header here?"

Content-based recognition (example)¶

A practical use case is differentiating commented JSON (CJSON) from plain JSON. File names like config.json might be CJSON (supports line comments) and thus can carry TopMark headers. You can provide a CJSON file type with a content_matcher that inspects the file for comment tokens (e.g. // or /* ... */) while avoiding naïve false positives:

def looks_like_cjson(path: Path) -> bool: try: text = path.read_text(encoding="utf-8", errors="ignore") except OSError: return False # Heuristics: allow // or / / outside of strings (simple check) return "//" in text or "/*" in text

Registering such a file type makes TopMark recognize these files; pairing it with a suitable header processor makes them supported for processing.

Notes

matches first tries extensions, filenames, and regex patterns. Only if those fail and content_matcher is set will it call the matcher to decide.
filenames entries are normalized and validated during construction. Tail-subpath rules are matched against path.as_posix(), but the rules themselves are stored canonically and should be serialized as POSIX-style strings.
Content matchers should read a small portion of the file where possible to remain fast on large trees. The current implementation leaves that policy to the callable to keep the base class simple.

qualified_key `property` ¶

qualified_key

Return the qualified identity key for this file type instance.

Format: "<namespace>:<local_key>".

matches ¶

matches(path)

Determine whether this file type matches a path.

Matching is attempted in deterministic order: extensions, filename rules, regex patterns, and then an optional content matcher if permitted by content_gate. Exact basename filename rules compare against path.name; relative tail-subpath filename rules compare against path.as_posix().

Parameters:

Name	Type	Description	Default
`path`	`Path`	Path to test against this file type.	required

Returns:

Type	Description
`bool`	True if the path matches this file type, False otherwise. Content
`bool`	matcher exceptions are caught and treated as non-matches.

Source code in src/topmark/filetypes/model.py

def matches(self, path: Path) -> bool:
    """Determine whether this file type matches a path.

    Matching is attempted in deterministic order: extensions, filename
    rules, regex patterns, and then an optional content matcher if permitted
    by `content_gate`. Exact basename filename rules compare against
    `path.name`; relative tail-subpath filename rules compare against
    `path.as_posix()`.

    Args:
        path: Path to test against this file type.

    Returns:
        True if the path matches this file type, False otherwise. Content
        matcher exceptions are caught and treated as non-matches.
    """
    # Track which name rule (if any) matched; used for content gating.
    matched_by: str | None = None

    # 1) Try matching by file extension (if present)
    if self.extensions:
        suffix: str = path.suffix
        name: str = path.name
        for ext in self.extensions:
            if ext.count(".") > 1:
                # Multiple-dot suffix (e.g., `.tar.gz`)
                if name.endswith(ext):
                    matched_by = "extension"
                    break
            else:
                # Single-dot suffix
                if suffix == ext:
                    matched_by = "extension"
                    break

    # 2) if still not matched, try filenames
    if matched_by is None and self.filenames:
        # Filename rules support exact basename and POSIX-style tail-subpath matches:
        #    - "settings.json" matches only if basename == "settings.json".
        #    - ".vscode/settings.json" matches if path.as_posix() ends with that tail.
        # Tail-subpath filename rules are matched against normalized POSIX-style paths.
        basename: str = path.name
        posix: str = path.as_posix()
        for fname in self.filenames:
            if "/" in fname:  # Path separators are already normalized
                if posix.endswith(fname):
                    matched_by = "filename"
                    break
            else:
                if basename == fname:
                    matched_by = "filename"
                    break

    # 3) if still not matched, try patterns
    if matched_by is None and self.patterns:
        # Regex patterns against basename (cached)
        if self._compiled_patterns is None:
            try:
                self._compiled_patterns = [re.compile(p) for p in self.patterns]
            except re.error:
                self._compiled_patterns = []
        for regex in self._compiled_patterns:
            if regex.fullmatch(path.name):
                matched_by = "pattern"
                break

    # 4) if still not matched, try content matcher (if present)
    if self.content_matcher is None:
        # Shortcut if no content matcher is defined:
        #    - If no name rule matched: False
        #    - If any name rule matched: True
        return matched_by is not None

    # Evaluate whether the content matcher is *allowed* to run, based on the gate.
    gate: Final[ContentGate] = self.content_gate
    allow_by_gate: bool
    if gate is ContentGate.NEVER:
        allow_by_gate = False
    elif gate is ContentGate.IF_EXTENSION:
        allow_by_gate = matched_by == "extension"
    elif gate is ContentGate.IF_FILENAME:
        allow_by_gate = matched_by == "filename"
    elif gate is ContentGate.IF_PATTERN:
        allow_by_gate = matched_by == "pattern"
    elif gate is ContentGate.IF_ANY_NAME_RULE:
        allow_by_gate = matched_by is not None
    elif gate is ContentGate.IF_NONE:
        # Permit content probing only if *no* name rules exist for this type.
        allow_by_gate = not (self.extensions or self.filenames or self.patterns)
    elif gate is ContentGate.ALWAYS:
        allow_by_gate = True
    else:
        allow_by_gate = False  # safety default

    # If gate disallows probing, return the name-rule result.
    if not allow_by_gate:
        return matched_by is not None

    # Gate allows probing: consult the content matcher.
    try:
        return bool(self.content_matcher(path))
    except Exception:  # noqa: BLE001 - user-provided matcher must not crash detection
        return False

topmark.filetypes.model¶

ContentGate ¶

ContentMatcher ¶

InsertCapability ¶

InsertCheckResult ¶

PreInsertHeaderProcessorView ¶

get_header_insertion_char_offset ¶

PreInsertContextView ¶

lines property ¶

newline_style property ¶

header_processor property ¶

file_type property ¶

InsertChecker ¶

FileType dataclass ¶

Content-based recognition (example)¶

qualified_key property ¶

matches ¶

lines `property` ¶

newline_style `property` ¶

header_processor `property` ¶

file_type `property` ¶

FileType `dataclass` ¶

qualified_key `property` ¶