Skip to content

topmark.resolution.files

topmark / resolution / files

Resolve the concrete filesystem inputs that TopMark should process.

This module expands configured or positional paths, applies include/exclude pattern filters, optionally constrains the candidate set by configured file type identifiers, and returns a deterministic list of files to process.

Conceptually, this module answers a different question from topmark.resolution.filetypes:

Positional globs are expanded relative to the current working directory (CWD). Globs declared in configuration files are expanded relative to the directory of each declaring config file. Paths loaded from files_from sources are resolved against their declaring source base directory.

The module also provides discovery-level probe helpers for topmark probe. Those helpers explain why explicitly requested paths did not reach file-type probing, without enumerating every recursively discovered file excluded during normal traversal.

FileListResolution dataclass

FileListResolution(
    *, selected, missing_literals, unmatched_patterns
)

Resolved file-list result plus discovery diagnostics.

Attributes:

Name Type Description
selected tuple[Path, ...]

Concrete files selected for processing.

missing_literals tuple[Path, ...]

Explicit literal input paths that do not exist.

unmatched_patterns tuple[str, ...]

Glob patterns that matched no files.

load_patterns_from_file

load_patterns_from_file(source)

Load non-empty, non-comment patterns from a text file.

The pattern semantics mirror .gitignore: each pattern is later evaluated relative to the pattern file's own base directory (source.base).

Parameters:

Name Type Description Default
source PatternSource

Reference to the pattern file and its base.

required

Returns:

Type Description
list[str]

A list of patterns as strings.

Source code in src/topmark/resolution/files.py
def load_patterns_from_file(
    source: PatternSource,
) -> list[str]:
    """Load non-empty, non-comment patterns from a text file.

    The pattern semantics mirror .gitignore: each pattern is later evaluated
    relative to the pattern file's own base directory (``source.base``).

    Args:
        source: Reference to the pattern file and its base.

    Returns:
        A list of patterns as strings.
    """
    try:
        text: str = source.path.read_text(encoding="utf-8")
    except FileNotFoundError as e:
        logger.error("Cannot read patterns from '%s': %s", source.path, e)
        return []
    except OSError as e:
        logger.error("Cannot read patterns from '%s': %s", source.path, e)
        return []
    patterns: list[str] = []
    for line in text.splitlines():
        s: str = line.strip()
        if not s or s.startswith("#"):
            continue
        logger.debug("Source %s - appending %s", source, s)
        patterns.append(s)
    logger.debug("Loaded %d pattern(s) from %s (base=%s)", len(patterns), source.path, source.base)
    return patterns

probe_explicit_file_selection

probe_explicit_file_selection(config, *, selected_files)

Explain explicit inputs that were not selected for file-type probing.

This helper is intentionally narrow: it only reports paths explicitly named through positional inputs or files_from. It does not enumerate every file excluded during recursive directory traversal, because that could produce unexpectedly large diagnostic output for topmark probe.

Parameters:

Name Type Description Default
config FrozenConfig

Effective layered configuration used for file discovery.

required
selected_files Sequence[Path]

Files selected by resolve_file_list().

required

Returns:

Type Description
FileSelectionProbeResult

Discovery-level probe results for explicit inputs that did not reach

...

file-type probing.

Source code in src/topmark/resolution/files.py
def probe_explicit_file_selection(
    config: FrozenConfig,
    *,
    selected_files: Sequence[Path],
) -> tuple[FileSelectionProbeResult, ...]:
    """Explain explicit inputs that were not selected for file-type probing.

    This helper is intentionally narrow: it only reports paths explicitly named
    through positional inputs or `files_from`. It does not enumerate every file
    excluded during recursive directory traversal, because that could produce
    unexpectedly large diagnostic output for `topmark probe`.

    Args:
        config: Effective layered configuration used for file discovery.
        selected_files: Files selected by `resolve_file_list()`.

    Returns:
        Discovery-level probe results for explicit inputs that did not reach
        file-type probing.
    """
    selected_real: set[Path] = _selected_real_paths(selected_files)
    results: list[FileSelectionProbeResult] = []

    for explicit in _explicit_input_paths(config):
        try:
            real: Path = explicit.resolve()
        except OSError:
            real = explicit

        if real in selected_real:
            continue

        if not explicit.exists():
            results.append(
                FileSelectionProbeResult(
                    path=explicit,
                    status=FileSelectionStatus.NOT_FOUND,
                    reason=FileSelectionReason.NOT_FOUND,
                )
            )
            continue

        if not explicit.is_file():
            results.append(
                FileSelectionProbeResult(
                    path=explicit,
                    status=FileSelectionStatus.FILTERED,
                    reason=FileSelectionReason.NOT_A_FILE,
                )
            )
            continue
        # The path exists and is a file, but it disappeared from the selected
        # file list. Classify the broad filter category while keeping exact
        # pattern/source attribution out of the stable probe contract for now.
        results.append(
            FileSelectionProbeResult(
                path=explicit,
                status=FileSelectionStatus.FILTERED,
                reason=_classify_explicit_filter_reason(explicit, config),
            )
        )

    return tuple(results)

resolve_file_list_with_diagnostics

resolve_file_list_with_diagnostics(config)

Return concrete input files plus discovery diagnostics.

The resolver implements these semantics
  1. Candidate set: Expand positional paths (files, directories recursively, and globs). If no positional paths are provided, extend with any literal paths read from --files-from before filtering. If the candidate set is still empty and include globs are configured, expand those include globs from both the current working directory (CLI perspective) and each discovered/explicit config file's directory (config perspective) to seed candidates.
  2. File-only: Only files (not directories) are kept for filtering.
  3. Include intersection: If any include patterns (from include pattern groups or include_from files) are given, filter the candidate set to only those files matching any include pattern (intersection filter).
  4. Exclude subtraction: If any exclude patterns (from exclude pattern groups or exclude_from files) are given, remove any files matching the exclusion patterns from the set.
  5. File type filter: If include_file_types or exclude_file_types are specified, further restrict to files matching those types.
  6. Returns a sorted list of Path objects for deterministic output.

Parameters:

Name Type Description Default
config FrozenConfig

Effective layered configuration.

required

Returns:

Type Description
FileListResolution
FileListResolution

containing selected files and discovery diagnostics.

Source code in src/topmark/resolution/files.py
def resolve_file_list_with_diagnostics(
    config: FrozenConfig,
) -> FileListResolution:
    """Return concrete input files plus discovery diagnostics.

    The resolver implements these semantics:
      1. **Candidate set**: Expand positional paths (files, directories recursively, and globs).
         If no positional paths are provided, extend with any literal paths read from
         ``--files-from`` **before filtering**. If the candidate set is still empty and
         include globs are configured, expand those include globs from **both** the current
         working directory (CLI perspective) and each discovered/explicit config file's
         directory (config perspective) to seed candidates.
      2. **File-only**: Only files (not directories) are kept for filtering.
      3. **Include intersection**: If any include patterns
         (from include pattern groups or `include_from` files)
         are given, filter the candidate set to only those files matching *any* include pattern
         (intersection filter).
      4. **Exclude subtraction**: If any exclude patterns
         (from exclude pattern groups or `exclude_from` files)
         are given, remove any files matching the exclusion patterns from the set.
      5. **File type filter**: If `include_file_types` or `exclude_file_types` are specified,
         further restrict to files matching those types.
      6. Returns a **sorted** list of Path objects for deterministic output.

    Args:
        config: Effective layered configuration.

    Returns:
        A [FileListResolution][topmark.resolution.files.FileListResolution]
        containing selected files and discovery diagnostics.
    """
    logger.debug("resolve_file_list(): config: %s", config)

    # Normalize config collections to stable, predictable types.
    positional_paths: tuple[str, ...] = config.files

    include_pattern_groups: tuple[PatternGroup, ...] = config.include_pattern_groups
    exclude_pattern_groups: tuple[PatternGroup, ...] = config.exclude_pattern_groups

    include_sources: tuple[PatternSource, ...] = config.include_from
    exclude_sources: tuple[PatternSource, ...] = config.exclude_from

    # Keep the original config-source identifiers so we can derive config-file
    # base directories for include/exclude pattern expansion.
    config_files: tuple[Path | SyntheticConfigSource, ...] = config.config_files

    include_file_types: frozenset[str] = frozenset(config.include_file_types)
    exclude_file_types: frozenset[str] = frozenset(config.exclude_file_types)

    files_from_sources: tuple[PatternSource, ...] = config.files_from

    workspace_root: Path | None = config.relative_to
    # Defensive fallback only; in normal operation resolve_config_from_click() sets this.
    if workspace_root is None:
        workspace_root = Path.cwd()

    cwd: Path = Path.cwd()

    if logger.isEnabledFor(TRACE_LEVEL):
        logger.trace(
            """\
    positional_paths: %s
    include_pattern_groups: %r
    include_sources: %s
    exclude_pattern_groups: %r
    exclude_sources: %s
    config_files: %s
    include_file_types: %s
    exclude_file_types: %s
    files_from_sources: %s
    workspace_root: %s
    config: %s
""",
            positional_paths,
            include_pattern_groups,
            include_sources,
            exclude_pattern_groups,
            exclude_sources,
            config_files,
            include_file_types,
            exclude_file_types,
            files_from_sources,
            workspace_root,
            config,
        )

    # -------- Precompute exclude specs to allow directory-level pruning --------

    # Each exclude spec is paired with its interpretation base directory.
    exclude_specs_all: list[tuple[GitIgnorePathSpec, Path]] = _compile_matchers(
        exclude_pattern_groups,
        exclude_sources,
    )

    def _is_excluded_dir(path: Path) -> bool:
        """Return True if a directory should be pruned during traversal.

        This uses the same PathSpec semantics as the later exclude subtraction step,
        but is applied to directory paths so we can avoid descending into subtrees
        that would be entirely excluded anyway.
        """
        real: Path = path.resolve()
        return _matches_any(exclude_specs_all, real)

    def _expand_path(p: Path) -> list[Path]:
        """Expand a base path into a list of files and directories.

        Handles globs, directories (recursively), and files.
        Globs are expanded relative to the current working directory.

        Args:
            p: Base path to expand.

        Returns:
            List of expanded paths (files and directories).
        """

        def _walk_dir(root: Path) -> list[Path]:
            """Walk a directory tree, pruning excluded subdirectories early."""
            out: list[Path] = []

            # If the root directory itself is excluded, skip its entire subtree.
            if _is_excluded_dir(root):
                logger.debug("Skipping excluded root dir during expansion: %s", root)
                return out

            for dirpath, dirnames, filenames in os.walk(root):
                dirpath_path: Path = Path(dirpath)

                # Prune excluded subdirectories in-place so os.walk never enters them.
                kept_dirnames: list[str] = []
                for name in dirnames:
                    subdir: Path = dirpath_path / name
                    if _is_excluded_dir(subdir):
                        logger.debug("Pruning excluded subdir during expansion: %s", subdir)
                        continue
                    kept_dirnames.append(name)
                dirnames[:] = kept_dirnames

                for fname in filenames:
                    out.append(dirpath_path / fname)

            return out

        # Glob patterns are expanded relative to CWD (Black-style args).
        # Keep this branch exactly as before to preserve Path.rglob semantics.
        if "*" in str(p):
            logger.debug("Processing glob pattern: %s", p)
            return list(Path().rglob(str(p)))
        # If the path is a directory, recursively include all files and subdirectories,
        # but prune directories that are already excluded by config.
        if p.is_dir():
            logger.debug("Processing dir: %s", p)
            path_list: list[Path] = _walk_dir(p)
            logger.debug(
                "Processing dir: %s - returning %d item(s)",
                p,
                len(path_list),
            )
            return path_list
        # If the path is a file, return it as a single-item list
        if p.is_file():
            return [p]
        # Otherwise, return empty list (path does not exist or is unsupported)
        return []

    # Step 1: Build candidate set from positional paths only; do not treat
    # config files as inputs. We'll optionally seed from include globs later.
    if len(positional_paths) > 0:
        # Use positional paths if provided
        # NOTE: This branch is reachable depending on CLI/config inputs;
        # some static analyzers may flag it falsely.
        input_paths: list[Path] = [Path(p) for p in positional_paths]
    else:
        input_paths = []

    logger.debug("Initial input paths: %s", input_paths)
    # Merge paths from files-from into the candidate inputs (resolve relatives vs. source.base)
    for psrc in files_from_sources or []:
        input_paths.extend(_read_input_paths_from_source(psrc))
    logger.debug("Input paths before expansion: %s", input_paths)

    # If there are no explicit inputs (positional or files-from) but include globs
    # were provided, expand them relative to the workspace root to seed candidates.
    if not input_paths and include_pattern_groups:
        # NOTE: This branch is reachable depending on CLI/config inputs;
        # some static analyzers may flag it falsely.
        expanded_from_includes: set[Path] = set()

        # Expand config-declared pattern groups relative to their declaring base.
        for grp in include_pattern_groups:
            base_dir: Path = grp.base.resolve()
            for pat in grp.patterns:
                for hit in base_dir.glob(pat):
                    if hit.is_file():
                        expanded_from_includes.add(hit.resolve())
        if expanded_from_includes:
            input_paths.extend(sorted(expanded_from_includes))
            logger.debug(
                "Expanded include patterns from CWD + %d group(s): %d match(es)",
                len(include_pattern_groups),
                len(expanded_from_includes),
            )

    # Step 2: Expand base paths into a set of files (and directories initially)
    candidate_set: set[Path] = set()
    unmatched_patterns: list[str] = []
    missing_literals: list[Path] = []

    for raw in input_paths:
        p = Path(raw)
        # Expand
        expanded: list[Path] = _expand_path(p)
        candidate_set.update(expanded)

        # Report problems *after* expansion
        if "*" in str(p):
            if not expanded:
                unmatched_patterns.append(str(p))  # glob that matched nothing
        else:
            if not p.exists():
                missing_literals.append(p)  # literal path that doesn't exist

    # Emit warnings once (keeps logs tidy)
    if unmatched_patterns:
        for up in unmatched_patterns:
            logger.warning("No matches for glob pattern: %s", up)
    if missing_literals:
        for ml in missing_literals:
            logger.warning("No such file or directory: %s", ml)

    # Only keep files (drop directories) before filtering
    candidate_set = {p for p in candidate_set if p.is_file()}

    # Step 3: Apply include intersection filter (if any include patterns)
    # Merge include_patterns + include_from patterns
    any_includes: bool = bool(include_pattern_groups) or bool(include_sources)
    if any_includes:
        kept: set[Path] = set()
        include_specs_all: list[tuple[GitIgnorePathSpec, Path]] = _compile_matchers(
            include_pattern_groups,
            include_sources,
        )
        if include_specs_all:
            for p in candidate_set:
                if _matches_any(include_specs_all, p):
                    kept.add(p)
        candidate_set = kept

    # Step 4: Apply exclude subtraction filter (if any exclude patterns/sources)
    any_excludes: bool = bool(exclude_pattern_groups) or bool(exclude_sources)
    if any_excludes:
        kept: set[Path] = set()
        # 4.1 exclude_patterns: evaluate against CWD and each config file directory
        if exclude_specs_all:
            # NOTE: This branch is reachable depending on CLI/config inputs;
            # some static analyzers may flag it falsely.
            for p in candidate_set:
                if not _matches_any(exclude_specs_all, p):
                    kept.add(p)
        else:
            kept = set(candidate_set)
        candidate_set = kept

    filtered_paths: set[Path] = candidate_set

    # Step 5: Filter files by configured file type identifiers if specified.
    #
    # Config values may contain either unqualified names ("markdown") or
    # qualified identifiers ("topmark:markdown"). Resolve them through the
    # namespace-aware file type registry before applying path-based matching.

    # 5.1: whitelisted file types
    # Invalid entries are handled and reported as Config diagnostic in MutableConfig.sanitize()
    if include_file_types:
        selected_include_types: list[FileType] = _resolve_configured_file_types(include_file_types)

        # Whitelisting:
        filtered_paths = {
            file_path
            for file_path in filtered_paths
            if _matches_any_file_type(file_path, selected_include_types)
        }

    # 5.2: blacklisted file types
    if exclude_file_types:
        selected_exclude_types: list[FileType] = _resolve_configured_file_types(exclude_file_types)

        # Blacklisting:
        filtered_paths = {
            file_path
            for file_path in filtered_paths
            if not _matches_any_file_type(file_path, selected_exclude_types)
        }

    # Step 6 (Finalize): dedupe by real path, prefer CWD-relative presentation
    out_by_real: dict[Path, Path] = {}
    for p in filtered_paths:
        real: Path = p.resolve()
        try:
            rel_to_cwd: Path = real.relative_to(cwd.resolve())
            rep: Path = rel_to_cwd
        except (OSError, ValueError):
            rep = real  # keep absolute if not within CWD
        if real not in out_by_real:
            out_by_real[real] = rep

    result: list[Path] = sorted(out_by_real.values(), key=lambda q: q.as_posix())
    logger.trace("Files to process: %d -- %s", len(result), result)
    return FileListResolution(
        selected=tuple(result),
        missing_literals=tuple(missing_literals),
        unmatched_patterns=tuple(unmatched_patterns),
    )