Skip to content

topmark.processors.base

topmark / processors / base

Header processor base module for TopMark's header processing pipeline.

This module defines the HeaderProcessor base class, which provides a framework for processing file headers in different file types. It includes logic for scanning, parsing, and rendering header fields according to comment styles and file extensions.

The module also supports associating processors with file types to enable flexible, extensible header processing in the TopMark pipeline.

Placement strategies

TopMark supports two complementary placement strategies:

  • Line-based insertion (default): processors return a line anchor from get_header_insertion_index(); pipeline steps use compute_insertion_anchor() as the façade to obtain that anchor.
  • Character-offset insertion (for positional formats like XML/HTML): processors return NO_LINE_ANCHOR from get_header_insertion_index() and implement get_header_insertion_char_offset() to compute a byte/character offset.

The pipeline first attempts text-based insertion when a char offset is provided; otherwise it falls back to the line-based strategy using the computed anchor.

RuntimeConfigLike

Bases: Protocol

Minimal structural subset of FrozenConfig required by HeaderProcessor.

This protocol keeps topmark.processors.base independent from the full runtime FrozenConfig model and avoids import cycles. Only the fields actually consumed by render_header_lines() are included here.

header_fields property

header_fields

List of header fields from the [header] section.

align_fields property

align_fields

Whether to align fields, from [formatting].

ProcessingContextLike

Bases: Protocol

Minimal structural subset of ProcessingContext required by HeaderProcessor.

This protocol keeps topmark.processors.base independent from the full pipeline context model and avoids import cycles. Only the views bundle and diagnostic sink methods needed by processor helpers are included here.

views property

views

Pipeline view bundle used by processor helpers.

diagnostics property

diagnostics

Mutable diagnostic sink used by processor helpers.

HeaderProcessor

HeaderProcessor(
    *,
    block_prefix=None,
    block_suffix=None,
    line_prefix=None,
    line_suffix=None,
    line_indent=None,
    header_indent=None,
)

Base class for header processors that handle specific file types.

A header processor knows how to find, render, and modify TopMark headers for one concrete topmark.filetypes.model.FileType. The registry binds a processor instance to a file type at runtime (proc.file_type = ft), and TopMark uses that pairing during scanning and updates.

Responsibilities
  • Scanning: Locate existing headers via start/end markers and comment affixes (see get_header_bounds, line_has_directive).
  • Parsing: Extract key→value pairs from the header payload (see parse_fields).
  • Rendering: Emit preamble/fields/postamble with proper comment syntax (see render_preamble_lines, render_header_lines, render_postamble_lines).
  • Placement policy: Determine insertion points; default is shebang-aware for languages like Python (see get_header_insertion_index).
  • Update/strip helpers: Prepare insertions and removals in a way that preserves surrounding whitespace (see prepare_header_for_insertion, strip_header_block).

What this class does not do: - Content-based recognition. Deciding which file type a path belongs to is the role of topmark.filetypes.model.FileType via FileType.content_matcher. The processor assumes it is already associated with the correct file type.

Indentation semantics
  • header_indent: indentation before the line prefix (used to preserve existing indentation when replacing nested/indented headers).
  • line_indent: indentation after the line prefix (applied to the header field lines).
Extension points

Subclasses typically set comment delimiters (line_prefix, line_suffix, block_prefix, block_suffix) and may override any of the hooks documented below to support format-specific behavior (e.g., XML prolog placement or Markdown fences).

Placement strategies
  • Line-based (default): override get_header_insertion_index() if needed. Pipeline steps call compute_insertion_anchor() as a stable façade.
  • Character-offset (XML/HTML-like): return NO_LINE_ANCHOR from get_header_insertion_index() and implement get_header_insertion_char_offset(); the pipeline will prefer this path.
Public API note

In the stable public surface, consider typing against a minimal protocol rather than this concrete base if you are authoring plugins. The registry binds processors to file types and exposes read-only metadata for common integrations.

Parameters:

Name Type Description Default
block_prefix str | None

The prefix string for block-style header start.

None
block_suffix str | None

The suffix string for block-style header end.

None
line_prefix str | None

The prefix string for each line within the header block.

None
line_suffix str | None

The suffix string for each line within the header block.

None
line_indent str | None

The indentation applied to header field lines after the comment prefix (e.g., spaces after //).

None
header_indent str | None

The indentation applied before the comment prefix; used to preserve existing leading indentation when replacing an indented header block inside a document (e.g., nested JSONC).

None

Attributes:

Name Type Description
namespace str

Processor namespace class metadata.

local_key str

Unique processor identity class metadata within its namespace.

description str

Human-readable processor description class metadata.

file_type FileType | None

The FileType bound to this processor instance by the registry.

block_prefix str

The prefix string for block-style header start.

block_suffix str

The suffix string for block-style header end.

line_prefix str

The prefix string for each line within the header block.

line_suffix str

The suffix string for each line within the header block.

line_indent str

The indentation applied to header field lines after the comment prefix (e.g., spaces after //).

header_indent str

The indentation applied before the comment prefix; used to preserve existing leading indentation when replacing an indented header block inside a document (e.g., nested JSONC).

Source code in src/topmark/processors/base.py
def __init__(
    self,
    *,
    block_prefix: str | None = None,
    block_suffix: str | None = None,
    line_prefix: str | None = None,
    line_suffix: str | None = None,
    line_indent: str | None = None,
    header_indent: str | None = None,
) -> None:
    self.file_type = None

    if block_prefix is not None:
        self.block_prefix = block_prefix
    if block_suffix is not None:
        self.block_suffix = block_suffix
    if line_prefix is not None:
        self.line_prefix = line_prefix
    if line_suffix is not None:
        self.line_suffix = line_suffix
    if line_indent is not None:
        self.line_indent = line_indent
    if header_indent is not None:
        self.header_indent = header_indent

    # Cache for per-policy encoding regex to avoid recompilation
    self._encoding_pattern: re.Pattern[str] | None = None
    self._encoding_pattern_src: str | None = None

qualified_key property

qualified_key

Return the qualified identity key for this processor.

Format: "<namespace>:<local_key>".

parse_fields

parse_fields(context)

Parse key-value pairs from the detected header block (view-based).

This implementation expects the scanner to have populated context.header with an outer slice (markers included). It searches within context.header.lines for the first START marker and the next END marker, then parses only the payload lines between them.

Parameters:

Name Type Description Default
context ProcessingContextLike

Pipeline context where header has been set to a topmark.pipeline.views.HeaderView (range/lines/block/mapping).

required

Returns:

Type Description
HeaderParseResult

Parsed mapping and per-line success/error counters.

Notes
  • Comment affixes (line_prefix / line_suffix) are stripped per line.
  • Malformed field lines add diagnostics but do not mutate context.status.header (handled by the scanner).
  • Subclasses may override to support multi-line fields or alternate syntax.
Source code in src/topmark/processors/base.py
def parse_fields(self, context: ProcessingContextLike) -> HeaderParseResult:
    """Parse key-value pairs from the detected header block (*view-based*).

    This implementation expects the scanner to have populated
    `context.header` with an outer slice (markers included). It searches
    within ``context.header.lines`` for the first START marker and the next END
    marker, then parses only the payload lines between them.

    Args:
        context: Pipeline context where ``header`` has been set to a
            [`topmark.pipeline.views.HeaderView`][] (range/lines/block/mapping).

    Returns:
        Parsed mapping and per-line success/error counters.

    Notes:
        - Comment affixes (``line_prefix`` / ``line_suffix``) are stripped per line.
        - Malformed field lines add diagnostics but do not mutate ``context.status.header``
          (handled by the scanner).
        - Subclasses may override to support multi-line fields or alternate syntax.
    """
    # Keep track of processed header entries (lines)
    cnt_header_ok: int = 0
    cnt_header_error: int = 0

    empty_result = HeaderParseResult()

    hv: HeaderView | None = context.views.header
    if hv is None or hv.range is None or hv.lines is None:
        return HeaderParseResult(
            fields={}, success_count=cnt_header_ok, error_count=cnt_header_error
        )

    # Operate on the header lines as provided by the scanner (outer slice).
    lines: list[str] = list(hv.lines)
    if not lines:
        return empty_result

    # 1) Locate START and END markers *within* the provided slice.
    start_rel: int | None
    end_rel: int | None
    start_rel, end_rel = self._find_inner_marker_indices(lines)
    if start_rel is None or end_rel is None or end_rel <= start_rel:
        # Keep scanner as the single authority for MALFORMED; just surface a diagnostic.
        context.diagnostics.add_error(
            "parse_fields(): could not locate a valid START/END marker pair."
        )
        return empty_result

    # 2) Extract payload (strictly between markers).
    payload: list[str] = lines[start_rel + 1 : end_rel]
    if not payload:
        return empty_result

    # 3) Parse lines as `key: value`, stripping comment affixes and whitespace.
    header_mapping: dict[str, str] = {}
    # Compute approximate absolute line number for diagnostics if we can.
    abs_start: int
    _abs_end: int
    abs_start, _abs_end = hv.range

    for i, raw in enumerate(payload, start=1):
        # Absolute line number in the original file (1-based)
        abs_line_no: int = abs_start + start_rel + i + 1
        logger.trace("Header line %d: [%s]", abs_line_no, raw)

        cleaned: str = self._strip_line_affixes(raw).strip()
        if not cleaned:
            continue

        if ":" not in cleaned:
            # Header line has no colon
            context.diagnostics.add_error(
                f"Malformed header at line {abs_line_no} (no colon found): {raw!r}"
            )
            cnt_header_error += 1
            continue

        key: str
        value: str
        key, value = cleaned.split(":", 1)
        k: str = key.strip()
        v: str = value.strip()
        if not k:
            # Header line has colon but empty text before colon
            context.diagnostics.add_error(
                f"Malformed header at line {abs_line_no} (empty text before colon): {raw!r}"
            )
            cnt_header_error += 1
            continue

        header_mapping[k] = v
        cnt_header_ok += 1

    return HeaderParseResult(
        fields=header_mapping,
        success_count=cnt_header_ok,
        error_count=cnt_header_error,
    )

render_preamble_lines

render_preamble_lines(
    *,
    newline_style,
    block_prefix=None,
    line_prefix=None,
    line_suffix=None,
    header_indent="",
)

Render the TopMark preamble lines for the current processor.

The preamble consists of

1) the block comment opener (when configured), 2) the TOPMARK_START_MARKER directive line, and 3) an intentional blank line following the start marker.

Parameters:

Name Type Description Default
newline_style str

Newline characters to append to each rendered line.

required
block_prefix str | None

Optional override for the block prefix; defaults to the instance's block_prefix when None.

None
line_prefix str | None

Optional override for the line prefix; defaults to the instance's line_prefix when None.

None
line_suffix str | None

Optional override for the line suffix; defaults to the instance's line_suffix when None.

None
header_indent str

The indentation applied before the comment prefix; used to preserve existing leading indentation when replacing an indented header block inside a document (e.g., nested JSONC).

''

Returns:

Type Description
list[str]

Preamble lines (each ending with newline_style) that precede the header fields.

Source code in src/topmark/processors/base.py
def render_preamble_lines(
    self,
    *,
    newline_style: str,
    block_prefix: str | None = None,
    line_prefix: str | None = None,
    line_suffix: str | None = None,
    header_indent: str = "",
) -> list[str]:
    """Render the TopMark preamble lines for the current processor.

    The preamble consists of:
      1) the block comment opener (when configured),
      2) the ``TOPMARK_START_MARKER`` directive line, and
      3) an intentional blank line following the start marker.

    Args:
        newline_style: Newline characters to append to each rendered line.
        block_prefix: Optional override for the block prefix; defaults to
            the instance's ``block_prefix`` when ``None``.
        line_prefix: Optional override for the line prefix; defaults to
            the instance's ``line_prefix`` when ``None``.
        line_suffix: Optional override for the line suffix; defaults to
            the instance's ``line_suffix`` when ``None``.
        header_indent: The indentation applied *before* the comment prefix; used
            to preserve existing leading indentation when replacing an indented
            header block inside a document (e.g., nested JSONC).

    Returns:
        Preamble lines (each ending with ``newline_style``) that precede the header fields.
    """
    bp: str = self.block_prefix if block_prefix is None else block_prefix
    lines: list[str] = []
    if bp:
        lines.append(header_indent + bp + newline_style)
    lines.append(
        self._wrap_line(
            TOPMARK_START_MARKER,
            newline_style=newline_style,
            line_prefix=line_prefix,
            line_suffix=line_suffix,
            header_indent=header_indent,
            after_prefix_indent="",
        )
    )
    # Empty line after start marker
    lines.append(
        self._wrap_line(
            "",
            newline_style=newline_style,
            line_prefix=line_prefix,
            line_suffix=line_suffix,
            header_indent=header_indent,
            after_prefix_indent="",
        )
    )
    return lines

render_postamble_lines

render_postamble_lines(
    *,
    newline_style,
    block_suffix=None,
    line_prefix=None,
    line_suffix=None,
    header_indent="",
)

Render the TopMark postamble lines for the current processor.

The postamble consists of

1) an intentional blank line before the end marker, 2) the TOPMARK_END_MARKER directive line, and 3) the block comment closer (when configured).

Parameters:

Name Type Description Default
newline_style str

Newline characters to append to each rendered line.

required
block_suffix str | None

Optional override for the block suffix; defaults to the instance's block_suffix when None.

None
line_prefix str | None

Optional override for the line prefix; defaults to the instance's line_prefix when None.

None
line_suffix str | None

Optional override for the line suffix; defaults to the instance's line_suffix when None.

None
header_indent str

The indentation applied before the comment prefix; used to preserve existing leading indentation when replacing an indented header block inside a document (e.g., nested JSONC).

''

Returns:

Type Description
list[str]

Postamble lines (each ending with newline_style) that follow the header fields.

Source code in src/topmark/processors/base.py
def render_postamble_lines(
    self,
    *,
    newline_style: str,
    block_suffix: str | None = None,
    line_prefix: str | None = None,
    line_suffix: str | None = None,
    header_indent: str = "",
) -> list[str]:
    """Render the TopMark postamble lines for the current processor.

    The postamble consists of:
      1) an intentional blank line before the end marker,
      2) the ``TOPMARK_END_MARKER`` directive line, and
      3) the block comment closer (when configured).

    Args:
        newline_style: Newline characters to append to each rendered line.
        block_suffix: Optional override for the block suffix; defaults to
            the instance's ``block_suffix`` when ``None``.
        line_prefix: Optional override for the line prefix; defaults to
            the instance's ``line_prefix`` when ``None``.
        line_suffix: Optional override for the line suffix; defaults to
            the instance's ``line_suffix`` when ``None``.
        header_indent: The indentation applied *before* the comment prefix; used
            to preserve existing leading indentation when replacing an indented
            header block inside a document (e.g., nested JSONC).

    Returns:
        Postamble lines (each ending with ``newline_style``) that follow the header fields.
    """
    bs: str = self.block_suffix if block_suffix is None else block_suffix
    lines: list[str] = []
    # Empty line before end marker
    lines.append(
        self._wrap_line(
            "",
            newline_style=newline_style,
            line_prefix=line_prefix,
            line_suffix=line_suffix,
            header_indent=header_indent,
            after_prefix_indent="",
        )
    )
    lines.append(
        self._wrap_line(
            TOPMARK_END_MARKER,
            newline_style=newline_style,
            line_prefix=line_prefix,
            line_suffix=line_suffix,
            header_indent=header_indent,
            after_prefix_indent="",
        )
    )
    if bs:
        lines.append(bs + newline_style)
    return lines

render_header_lines

render_header_lines(
    header_values,
    config,
    newline_style,
    block_prefix_override=None,
    block_suffix_override=None,
    line_prefix_override=None,
    line_suffix_override=None,
    line_indent_override=None,
    header_indent_override=None,
)

Render a header block from configuration, template, and overrides.

This method generates a header string using the configuration's header fields and values, optionally overridden by provided header_list and custom_headers. It respects alignment and raw_header settings from the configuration to format the output.

Parameters:

Name Type Description Default
header_values Mapping[str, str]

Mapping of header fields to render.

required
config RuntimeConfigLike

TopMark configuration (defines header fields and options).

required
newline_style str

Newline style (LF, CR, CRLF).

required
block_prefix_override str | None

Optional block prefix override.

None
block_suffix_override str | None

Optional block suffix override.

None
line_prefix_override str | None

Optional line prefix override.

None
line_suffix_override str | None

Optional line suffix override.

None
line_indent_override str | None

Optional indentation override after the comment prefix, applied to header field lines (defaults to the processor's line_indent).

None
header_indent_override str | None

Optional indentation override before the comment prefix, applied to complete header lines (used to preserve existing leading indentation on replace).

None

Returns:

Type Description
list[str]

Rendered header lines ending with newline_style.

Source code in src/topmark/processors/base.py
def render_header_lines(
    self,
    header_values: Mapping[str, str],
    config: RuntimeConfigLike,
    newline_style: str,
    block_prefix_override: str | None = None,
    block_suffix_override: str | None = None,
    line_prefix_override: str | None = None,
    line_suffix_override: str | None = None,
    line_indent_override: str | None = None,
    header_indent_override: str | None = None,
) -> list[str]:
    """Render a header block from configuration, template, and overrides.

    This method generates a header string using the configuration's header fields and
    values, optionally overridden by provided header_list and custom_headers. It respects
    alignment and raw_header settings from the configuration to format the output.

    Args:
        header_values: Mapping of header fields to render.
        config: TopMark configuration (defines header fields and options).
        newline_style: Newline style (``LF``, ``CR``, ``CRLF``).
        block_prefix_override: Optional block prefix override.
        block_suffix_override: Optional block suffix override.
        line_prefix_override: Optional line prefix override.
        line_suffix_override: Optional line suffix override.
        line_indent_override: Optional indentation override *after*
            the comment prefix, applied to header field lines (defaults to the
            processor's `line_indent`).
        header_indent_override: Optional indentation override *before*
            the comment prefix, applied to complete header lines (used to preserve
            existing leading indentation on replace).

    Returns:
        Rendered header lines ending with ``newline_style``.
    """
    logger.info(
        "%s: rendering header fields: %s",
        self.__class__.__name__,
        ", ".join(config.header_fields),
    )
    logger.debug("render_header_lines: align_fields=%s", config.align_fields)

    # Use provided overrides or defaults from the instance
    block_prefix = (
        block_prefix_override if block_prefix_override is not None else self.block_prefix
    )
    block_suffix = (
        block_suffix_override if block_suffix_override is not None else self.block_suffix
    )
    line_prefix = line_prefix_override if line_prefix_override is not None else self.line_prefix
    line_suffix = line_suffix_override if line_suffix_override is not None else self.line_suffix
    effective_line_indent = (
        line_indent_override if line_indent_override is not None else self.line_indent
    )
    header_indent = (
        header_indent_override if header_indent_override is not None else self.header_indent
    )

    # Compute header field name width only when alignment is enabled.
    # When align_fields is False, emit compact "field : value" without padding.
    if config.align_fields and header_values:
        width: int = max(len(k) for k in header_values) + 1
    else:
        width = 0

    # Build the header lines
    lines: list[str] = []

    # Compose preamble
    lines.extend(
        self.render_preamble_lines(
            newline_style=newline_style,
            block_prefix=block_prefix,
            line_prefix=line_prefix,
            line_suffix=line_suffix,
            header_indent=header_indent,
        )
    )

    # Field lines (no blanks in-between)
    for field in config.header_fields:
        value: str = header_values.get(field, "")
        inner: str = f"{field:<{width}}: {value}" if width else f"{field}: {value}"
        lines.append(
            self._wrap_line(
                inner,
                newline_style=newline_style,
                line_prefix=line_prefix,
                line_suffix=line_suffix,
                header_indent=header_indent,
                after_prefix_indent=effective_line_indent,
            )
        )

    # Compose postamble
    lines.extend(
        self.render_postamble_lines(
            newline_style=newline_style,
            block_suffix=block_suffix,
            line_prefix=line_prefix,
            line_suffix=line_suffix,
            header_indent=header_indent,
        )
    )

    logger.debug("Rendered %d header lines:\n%s", len(lines), "".join(lines))

    return lines

compute_insertion_anchor

compute_insertion_anchor(lines)

Return a stable line-based insertion anchor for the pipeline.

This small facade exists so pipeline steps have a single, stable entry point for line-based placement. By default, it simply delegates to get_header_insertion_index.

Processors that insert by character offset (e.g., XML/HTML) should override get_header_insertion_index to return NO_LINE_ANCHOR, which this method will propagate unchanged.

Parameters:

Name Type Description Default
lines list[str]

Full file content split into lines.

required

Returns:

Type Description
int

A 0-based line index where a header would be inserted, or NO_LINE_ANCHOR when

int

line-based anchoring is not used.

Source code in src/topmark/processors/base.py
def compute_insertion_anchor(self, lines: list[str]) -> int:
    """Return a stable line-based insertion anchor for the pipeline.

    This small facade exists so pipeline steps have a single, stable
    entry point for *line-based* placement. By default, it simply
    delegates to `get_header_insertion_index`.

    Processors that insert by **character offset** (e.g., XML/HTML) should
    override `get_header_insertion_index` to return
    `NO_LINE_ANCHOR`, which this method will propagate unchanged.

    Args:
        lines: Full file content split into lines.

    Returns:
        A 0-based line index where a header would be inserted, or `NO_LINE_ANCHOR` when
        line-based anchoring is not used.
    """
    return self.get_header_insertion_index(lines)

get_header_insertion_index

get_header_insertion_index(file_lines)

Determine where to insert the header based on file type policy.

Default behavior is shebang-aware: - If the file type policy declares supports_shebang=True and the first line starts with #!, insert the header after the shebang (and optional encoding line when encoding_line_regex is provided). - Otherwise, insert at the top of file (index 0).

If inserting after a preamble and the next line is already blank, consume exactly one existing blank line so that a single blank separates the preamble from the header.

Subclasses may override this when a format imposes different placement rules.

Parameters:

Name Type Description Default
file_lines list[str]

Lines from the file being processed.

required

Returns:

Type Description
int

Index at which to insert the TopMark header, or NO_LINE_ANCHOR if no insertion

int

index can be found.

Source code in src/topmark/processors/base.py
def get_header_insertion_index(self, file_lines: list[str]) -> int:
    """Determine where to insert the header based on file type policy.

    Default behavior is *shebang-aware*:
      - If the file type policy declares ``supports_shebang=True`` and the first line
        starts with ``#!``, insert the header *after* the shebang (and optional encoding
        line when ``encoding_line_regex`` is provided).
      - Otherwise, insert at the top of file (index 0).

    If inserting after a preamble and the next line is already blank, consume exactly
    one existing blank line so that a single blank separates the preamble from the header.

    Subclasses may override this when a format imposes different placement rules.

    Args:
        file_lines: Lines from the file being processed.

    Returns:
        Index at which to insert the TopMark header, or ``NO_LINE_ANCHOR`` if no insertion
        index can be found.
    """
    index = 0
    shebang_present = False

    # Shebang handling based on per-file-type policy
    policy: FileTypeHeaderPolicy | None = (
        self.file_type.header_policy if self.file_type else None
    )
    if policy and policy.supports_shebang and file_lines and file_lines[0].startswith("#!"):
        shebang_present = True
        index = 1

        # Optional encoding line immediately after shebang (e.g., Python)
        if policy.encoding_line_regex and len(file_lines) > index:
            src = policy.encoding_line_regex
            # Compile on first use or when the pattern string changes
            if self._encoding_pattern is None or self._encoding_pattern_src != src:
                self._encoding_pattern = re.compile(src)
                self._encoding_pattern_src = src
            if self._encoding_pattern.search(file_lines[index]):
                index += 1

    # If a shebang block exists and the next line is a *policy-blank*, consume exactly one.
    # This keeps a single spacer between the preamble and the header without eating content
    # under STRICT (e.g., form-feed \x0c is preserved).
    if (
        shebang_present
        and index < len(file_lines)
        and is_pure_spacer(file_lines[index], policy)
    ):
        index += 1

    return index

line_has_directive

line_has_directive(line, directive)

Check whether a line contains the directive with the expected affixes.

This method is used by get_header_bounds() to locate header start/end markers. Subclasses may override this method for more flexible or format-specific matching.

Parameters:

Name Type Description Default
line str

The line of text to check (whitespace is trimmed internally).

required
directive str

The directive string to look for.

required

Returns:

Type Description
bool

True if the line contains the directive with the configured prefix/suffix,

bool

otherwise False.

Source code in src/topmark/processors/base.py
def line_has_directive(self, line: str, directive: str) -> bool:
    """Check whether a line contains the directive with the expected affixes.

    This method is used by ``get_header_bounds()`` to locate header start/end markers.
    Subclasses may override this method for more flexible or format-specific matching.

    Args:
        line: The line of text to check (whitespace is trimmed internally).
        directive: The directive string to look for.

    Returns:
        ``True`` if the line contains the directive with the configured prefix/suffix,
        otherwise ``False``.
    """
    # This method matches directives with configured affixes; policy-based blank
    # collapsing does not apply here. Normalize incidental whitespace for affix matching.
    line = line.strip()

    # Step 1: Check for the presence of the defined prefix
    if self.line_prefix and not line.startswith(self.line_prefix):
        return False

    # Step 2: Check for the presence of the defined suffix
    if self.line_suffix and not line.endswith(self.line_suffix):
        return False

    # Step 3: Remove the prefix and suffix and check the remaining content
    candidate: str = line
    if self.line_prefix:
        candidate = candidate.removeprefix(self.line_prefix)
    if self.line_suffix:
        candidate = candidate.removesuffix(self.line_suffix)

    # # Step 4: Strip whitespace after removing affixes to match the directive exactly.
    candidate = candidate.strip()

    return candidate == directive

validate_header_location

validate_header_location(
    lines, *, header_start_idx, header_end_idx, anchor_idx
)

Validate that a detected header is at an acceptable location.

The default policy accepts a candidate header only when its start line is exactly at the computed anchor or within a small proximity window around it. Subclasses may override this to enforce format-specific constraints.

Parameters:

Name Type Description Default
lines list[str]

Full file content split into lines.

required
header_start_idx int

0-based index of the candidate header's first line.

required
header_end_idx int

0-based index of the candidate header's last line (inclusive).

required
anchor_idx int

0-based index where a header would be inserted per policy.

required

Returns:

Type Description
bool

True if the candidate lies within the configured proximity window,

bool

otherwise False.

Notes

The proximity window can be tuned per file type by defining scan_window_before and scan_window_after on the associated FileType. Defaults are 0 and 2, respectively.

Source code in src/topmark/processors/base.py
def validate_header_location(
    self,
    lines: list[str],
    *,
    header_start_idx: int,
    header_end_idx: int,
    anchor_idx: int,
) -> bool:
    """Validate that a detected header is at an acceptable location.

    The default policy accepts a candidate header only when its *start* line is
    exactly at the computed anchor or within a small proximity window around it.
    Subclasses may override this to enforce format-specific constraints.

    Args:
        lines: Full file content split into lines.
        header_start_idx: 0-based index of the candidate header's first line.
        header_end_idx: 0-based index of the candidate header's last line (inclusive).
        anchor_idx: 0-based index where a header would be inserted per policy.

    Returns:
        ``True`` if the candidate lies within the configured proximity window,
        otherwise ``False``.

    Notes:
        The proximity window can be tuned per file type by defining
        ``scan_window_before`` and ``scan_window_after`` on the associated
        ``FileType``. Defaults are 0 and 2, respectively.
    """
    # Per-file-type tunables (fallback to conservative defaults)
    before = 0
    after = 2
    if self.file_type is not None:
        # TODO: add both properties to the FileType dataclass
        before = int(getattr(self.file_type, "scan_window_before", before) or 0)
        after = int(getattr(self.file_type, "scan_window_after", after) or 2)

    return (anchor_idx - before) <= header_start_idx <= (anchor_idx + after)

get_header_bounds

get_header_bounds(*, lines, newline_style)

Locate the TopMark header bounds as (start_idx, end_idx), inclusive.

This method first performs a marker preflight to catch malformed shapes (e.g., lone :end, lone :start, multiple or reversed markers). It then applies format-aware detection and proximity validation to return a valid span when present.

Parameters:

Name Type Description Default
lines Iterable[str]

Logical file lines (keepends=True). The iterable may be list-backed or lazy (e.g., a generator).

required
newline_style str

Dominant newline style (LF, CR, CRLF); unused by the default scanner but kept for parity with callers.

required

Returns:

Type Description
HeaderBounds

A discriminated result: - BoundsKind.SPAN with start (inclusive) and end (exclusive) when a valid header can be used. - BoundsKind.MALFORMED with a best-effort range and reason when markers exist but the shape is invalid. - BoundsKind.NONE when no markers are present.

Notes

Subclasses may override this method to provide format-specific detection and location validation but should preserve the discriminated-union semantics of the return value.

Source code in src/topmark/processors/base.py
def get_header_bounds(
    self,
    *,
    lines: Iterable[str],
    newline_style: str,
) -> HeaderBounds:
    """Locate the TopMark header bounds as (start_idx, end_idx), inclusive.

    This method first performs a **marker preflight** to catch malformed
    shapes (e.g., lone ``:end``, lone ``:start``, multiple or reversed markers).
    It then applies format-aware detection and proximity validation to return
    a valid span when present.

    Args:
        lines: Logical file lines (``keepends=True``). The iterable
            may be list-backed or lazy (e.g., a generator).
        newline_style: Dominant newline style (``LF``, ``CR``, ``CRLF``);
            unused by the default scanner but kept for parity with callers.

    Returns:
        A discriminated result:
            - ``BoundsKind.SPAN`` with ``start`` (inclusive) and ``end`` (exclusive)
              when a valid header can be used.
            - ``BoundsKind.MALFORMED`` with a best-effort range and ``reason`` when
              markers exist but the shape is invalid.
            - ``BoundsKind.NONE`` when no markers are present.

    Notes:
        Subclasses may override this method to provide format-specific detection
        and location validation but should preserve the discriminated-union
        semantics of the return value.
    """
    # Materialize once for look-ahead and validation.
    buf: list[str] = list(lines)

    if not buf:
        return HeaderBounds(kind=BoundsKind.NONE)

    # --- Preflight: marker-shape scan (format-agnostic) --------------------
    start_idxs: list[int] = []
    end_idxs: list[int] = []
    i: int
    ln: str
    for i, ln in enumerate(buf):
        # Accept either exact directive lines or markers inside a single-line comment
        # wrapper; the more exact check (line_has_directive) happens later.
        if TOPMARK_START_MARKER in ln:
            start_idxs.append(i)
        if TOPMARK_END_MARKER in ln:
            end_idxs.append(i)

    if end_idxs and not start_idxs:
        i = end_idxs[0]
        reason: str = "end marker without preceding start"
        logger.debug(reason)
        return HeaderBounds(
            kind=BoundsKind.MALFORMED,
            start=None,
            end=i + 1,
            reason="end marker without preceding start",
        )

    if start_idxs and not end_idxs:
        s: int = start_idxs[0]
        reason = "start marker without matching end"
        logger.debug(reason)
        return HeaderBounds(
            kind=BoundsKind.MALFORMED,
            start=s,
            end=None,
            reason="start marker without matching end",
        )

    if start_idxs and end_idxs:
        # We only want to find the first header occurrence
        s0: int
        e0: int
        s0, e0 = start_idxs[0], end_idxs[0]
        if e0 < s0:
            s_min: int = min(s0, e0)
            e_max: int = max(s0, e0) + 1
            reason = "end marker before start marker"
            logger.debug(reason)
            return HeaderBounds(
                kind=BoundsKind.MALFORMED,
                start=s_min,
                end=e_max,
                reason=reason,
            )
        elif e0 == s0:
            # Exclusive end: cover the single offending line for consistent diagnostics.
            reason = "start and end marker on the same line"
            logger.debug(reason)
            return HeaderBounds(
                kind=BoundsKind.MALFORMED,
                start=s0,
                end=e0 + 1,
                reason=reason,
            )

    # --- Policy-aware detection near computed anchor -----------------------
    anchor_idx: int = self.compute_insertion_anchor(buf)
    if anchor_idx == NO_LINE_ANCHOR:
        text: str = "".join(buf)
        char_off: int | None = self.get_header_insertion_char_offset(text)
        if char_off is not None:
            # Translate char offset to a line index using newline_style
            # (best-effort; the default processor doesn't rely on it further).
            nl: str = newline_style or "\n"
            anchor_idx = text[:char_off].count(nl)
        else:
            anchor_idx = 0

    if self.block_prefix and self.block_suffix:
        candidates: list[tuple[int, int]] = self._collect_bounds_block_comments(buf)
        # should return outer-inclusive spans
    else:
        candidates = self._collect_bounds_line_comments(buf)

    for s, e_inclusive in candidates:
        # Convert inclusive end → exclusive end for view/bounds consumers.
        e_exclusive: int = e_inclusive + 1
        if self.validate_header_location(
            buf,
            header_start_idx=s,
            header_end_idx=e_inclusive,
            anchor_idx=anchor_idx,
        ):
            return HeaderBounds(kind=BoundsKind.SPAN, start=s, end=e_exclusive)

    # No acceptable header near the anchor; treat as absent.
    return HeaderBounds(kind=BoundsKind.NONE)

strip_header_block

strip_header_block(
    *,
    lines,
    span=None,
    newline_style="\n",
    ends_with_newline=None,
)

Remove the TopMark header block and return the updated file image.

This method supports two detection modes:

  1. Policy-aware detection (preferred): If span is not provided, the processor calls get_header_bounds(lines, newline_style) to locate a valid header near the computed insertion anchor. This respects file-type placement rules (shebang handling, XML prolog, Markdown fences, etc.).

  2. Permissive fallback (best-effort): If policy-aware detection fails, the method performs a lightweight scan for the first START..END marker pair anywhere in the file. The scan accepts either exact directive matches (prefix/suffix aware) or marker substrings appearing inside single-line comment wrappers (e.g., <!-- TOPMARK_START_MARKER --> for XML/HTML/Markdown). This covers older files or content transformed by formatters.

When a header is removed at the very top of the file (start == 0), the method trims exactly one leading blank line that may be left behind by the removal to avoid introducing an extra gap.

Parameters:

Name Type Description Default
lines list[str]

Full file content split into lines (each typically ending with a newline).

required
span tuple[int, int] | None

Optional inclusive (start, end) line index tuple, normally provided by the scanner via ctx.existing_header_range. When set, no scanning is performed.

None
newline_style str

Newline style (LF, CR, CRLF).

'\n'
ends_with_newline bool | None

If known, whether the original file ended with a newline. If None, this information is not available.

None

Returns:

Type Description
StripHeaderResult

Structured strip result containing the updated file lines, the

StripHeaderResult

inclusive removed span when a header was removed, and the diagnostic

StripHeaderResult

describing the outcome.

Raises:

Type Description
RuntimeError

If policy-aware bounds detection reports a SPAN but omits start/end indices.

Source code in src/topmark/processors/base.py
def strip_header_block(
    self,
    *,
    lines: list[str],
    span: tuple[int, int] | None = None,
    newline_style: str = "\n",
    ends_with_newline: bool | None = None,
) -> StripHeaderResult:
    """Remove the TopMark header block and return the updated file image.

    This method supports two detection modes:

    1. **Policy-aware detection** (preferred):
       If ``span`` is not provided, the processor calls
       ``get_header_bounds(lines, newline_style)``
       to locate a valid header near the computed insertion anchor. This respects
       file-type placement rules (shebang handling, XML prolog, Markdown fences, etc.).

    2. **Permissive fallback** (best-effort):
       If policy-aware detection fails, the method performs a lightweight scan for
       the first ``START``..``END`` marker pair *anywhere* in the file. The scan
       accepts either exact directive matches (prefix/suffix aware) **or** marker
       substrings appearing inside single-line comment wrappers (e.g.,
       ``<!-- TOPMARK_START_MARKER -->`` for XML/HTML/Markdown). This covers older
       files or content transformed by formatters.

    When a header is removed at the very top of the file (``start == 0``), the
    method trims **exactly one** leading blank line that may be left behind by the
    removal to avoid introducing an extra gap.

    Args:
        lines: Full file content split into lines (each typically ending with a newline).
        span: Optional inclusive ``(start, end)`` line index tuple,
            normally provided by the scanner via ``ctx.existing_header_range``.
            When set, no scanning is performed.
        newline_style: Newline style (``LF``, ``CR``, ``CRLF``).
        ends_with_newline: If known, whether the original file ended with a newline.
            If ``None``, this information is not available.

    Returns:
        Structured strip result containing the updated file lines, the
        inclusive removed span when a header was removed, and the diagnostic
        describing the outcome.

    Raises:
        RuntimeError: If policy-aware bounds detection reports a SPAN but omits
            start/end indices.
    """
    # 1) Resolve bounds: prefer explicit span, else policy-aware detection.
    if span is None:
        # First try the standard, policy-aware bounds detection.
        start: int | None
        end: int | None
        bounds: HeaderBounds = self.get_header_bounds(lines=lines, newline_style=newline_style)
        if bounds.kind is BoundsKind.SPAN:
            # convert exclusive end to inclusive span expected by this method
            if bounds.start is None or bounds.end is None:
                raise RuntimeError("Start and end bounds must be defined.")
            span = (bounds.start, bounds.end - 1)

        elif bounds.kind is BoundsKind.MALFORMED:
            # Do not strip malformed headers; return unchanged lines.
            return StripHeaderResult(
                lines=lines,
                removed_span=None,
                diagnostic=StripDiagnostic(
                    kind=StripDiagKind.MALFORMED_REFUSED,
                    reason=bounds.reason,
                ),
            )

        else:  # BoundsKind.NONE
            span = None
            # fall through to the permissive scan you already have

        if span is None:
            # Permissive scan: accept directive substrings inside single-line
            # comment wrappers (e.g., XML/HTML `<!-- ... -->`).
            # Useful when stripping headers that were inserted by older versions
            # or were moved by formatting tools.
            n: int = len(lines)
            i = 0
            while i < n:
                # Accept either exact directive match (prefix/suffix-aware)
                # or the directive appearing inside a single-line comment wrapper.
                start_match: bool = self.line_has_directive(lines[i], TOPMARK_START_MARKER) or (
                    TOPMARK_START_MARKER in lines[i]
                )
                if start_match:
                    j: int = i + 1
                    while j < n:
                        end_match: bool = self.line_has_directive(
                            lines[j], TOPMARK_END_MARKER
                        ) or (TOPMARK_END_MARKER in lines[j])
                        if end_match:
                            span = (i, j)
                            break
                        j += 1
                    if span is not None:
                        break
                i += 1

    # 2) No header? Return original content unchanged.
    if span is None:
        return StripHeaderResult(
            lines=lines,
            removed_span=None,
            diagnostic=StripDiagnostic(
                kind=StripDiagKind.NOT_FOUND,
            ),
        )

    start, end = span
    # Defensive validation of bounds
    if start < 0 or end < start or end >= len(lines):
        # Defensive: invalid span -> no-op
        return StripHeaderResult(
            lines=lines,
            removed_span=None,
            diagnostic=StripDiagnostic(
                kind=StripDiagKind.NOT_FOUND,
            ),
        )

    # Remove the block (inclusive header span)
    new_lines: list[str] = lines[:start] + lines[end + 1 :]
    policy: FileTypeHeaderPolicy | None = getattr(
        getattr(self, "file_type", None), "header_policy", None
    )

    # Policy-aware cleanup: trim exactly one spacer left by removal.
    # Use in-place deletion (del) to preserve list identity.
    if start == 0:
        # Top-of-file: remove a single leading spacer if present
        if new_lines and is_pure_spacer(new_lines[0], policy):
            del new_lines[0]
    else:
        # General case: remove a single spacer at the removal site
        if 0 <= start < len(new_lines) and is_pure_spacer(new_lines[start], policy):
            del new_lines[start]

    return StripHeaderResult(
        lines=new_lines,
        removed_span=(start, end),
        diagnostic=StripDiagnostic(
            kind=StripDiagKind.REMOVED,
            removed_span=(start, end),
        ),
    )

prepare_header_for_insertion

prepare_header_for_insertion(
    *,
    original_lines,
    insert_index,
    rendered_header_lines,
    newline_style,
)

Adjust whitespace around the header for line-based insertion.

Default implementation returns rendered_header_lines unchanged. Subclasses and mixins can override to add/remove leading or trailing blank lines depending on surrounding context.

Parameters:

Name Type Description Default
original_lines list[str]

The original file lines.

required
insert_index int

Line index at which the header will be inserted.

required
rendered_header_lines list[str]

The header lines to insert.

required
newline_style str

Newline style (LF, CR, CRLF).

required

Returns:

Type Description
list[str]

Possibly modified header lines to insert at insert_index.

Source code in src/topmark/processors/base.py
def prepare_header_for_insertion(
    self,
    *,
    original_lines: list[str],
    insert_index: int,
    rendered_header_lines: list[str],
    newline_style: str,
) -> list[str]:
    """Adjust whitespace around the header for line-based insertion.

    Default implementation returns ``rendered_header_lines`` unchanged. Subclasses
    and mixins can override to add/remove leading or trailing blank lines
    depending on surrounding context.

    Args:
        original_lines: The original file lines.
        insert_index: Line index at which the header will be inserted.
        rendered_header_lines: The header lines to insert.
        newline_style: Newline style (``LF``, ``CR``, ``CRLF``).

    Returns:
        Possibly modified header lines to insert at ``insert_index``.
    """
    return rendered_header_lines

get_header_insertion_char_offset

get_header_insertion_char_offset(original_text)

Return a character offset for text-based insertion, or None.

This hook enables processors to compute non line-based insertion points (e.g., XML prolog-aware placement when declaration/DOCTYPE and content appear on the same line). Returning None signals that the pipeline should fall back to the standard line-based insertion path.

Parameters:

Name Type Description Default
original_text str

Full file content as a single string.

required

Returns:

Type Description
int | None

0-based character offset at which to insert, or None to use the line-based

int | None

insertion strategy.

Source code in src/topmark/processors/base.py
def get_header_insertion_char_offset(self, original_text: str) -> int | None:
    """Return a character offset for text-based insertion, or ``None``.

    This hook enables processors to compute non line-based insertion points
    (e.g., XML prolog-aware placement when declaration/DOCTYPE and content appear
    on the same line). Returning ``None`` signals that the pipeline should fall
    back to the standard line-based insertion path.

    Args:
        original_text: Full file content as a single string.

    Returns:
        0-based character offset at which to insert, or ``None`` to use the line-based
        insertion strategy.
    """
    return None

prepare_header_for_insertion_text

prepare_header_for_insertion_text(
    *,
    original_text,
    insert_offset,
    rendered_header_text,
    newline_style,
)

Adjust the rendered header text before text-based insertion.

Subclasses may override this to add or trim surrounding newlines so the header block sits on its own lines when performing text-based insertion.

Parameters:

Name Type Description Default
original_text str

Full file content as a single string.

required
insert_offset int

0-based character offset where the header will be inserted.

required
rendered_header_text str

The header block as a single string.

required
newline_style str

Newline style (LF, CR, CRLF).

required

Returns:

Type Description
str

The (possibly modified) header text to splice into original_text at

str

insert_offset.

Source code in src/topmark/processors/base.py
def prepare_header_for_insertion_text(
    self,
    *,
    original_text: str,
    insert_offset: int,
    rendered_header_text: str,
    newline_style: str,
) -> str:
    """Adjust the rendered header *text* before text-based insertion.

    Subclasses may override this to add or trim surrounding newlines so the header
    block sits on its own lines when performing text-based insertion.

    Args:
        original_text: Full file content as a single string.
        insert_offset: 0-based character offset where the header will be inserted.
        rendered_header_text: The header block as a single string.
        newline_style: Newline style (``LF``, ``CR``, ``CRLF``).

    Returns:
        The (possibly modified) header text to splice into ``original_text`` at
        ``insert_offset``.
    """
    return rendered_header_text