Usage Instructions
==================

This section provides practical examples demonstrating how to run the Gene Discovery and Enzyme Engineering features of the GDEE Platform.

Overview
--------

The GDEE platform provides a unified interface through the :class:`~gdee.engineer.ProteinEngineering` class for executing comprehensive protein engineering workflows. The platform supports:

- Multiple variant generation strategies (from sequence-based gene discovery to mutation-based enzyme engineering)
- 3D structure modeling with MODELLER
- Model quality assessment using VoroMQA and Normalized DOPE
- Molecular docking with AutoDock Vina/Vinardo
- Distance measurements for pose filtering
- Both single-machine and distributed MPI execution
- SQLite database storage for all results

The following sections illustrate how to configure and run typical workflows for gene discovery, enzyme engineering, and rescoring existing results.


Configuration Parameters Reference
----------------------------------

This section provides a comprehensive reference for all configuration parameters available in the GDEE platform through the :class:`~gdee.engineer.ProteinEngineering` class.

Platform Configuration
~~~~~~~~~~~~~~~~~~~~~~

Control the execution environment and computational resources:

.. code-block:: python

    # Platform execution settings
    eng.platform["name"] = "simple"     # Execution mode: "simple" or "mpi"
    eng.platform["local_cpu"] = 1       # Number of CPU cores per MPI process

**Available Options:**

- **name**:
  - ``"simple"``: Single-threaded execution on local machine
  - ``"mpi"``: Distributed execution using MPI for parallel processing
- **local_cpu**: Integer specifying CPU cores for local parallelism within each MPI process

File Management Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Configure output file organization and compression:

.. code-block:: python

    # File archiving and output management
    eng.io["output"] = "files"                   # Base directory for output files
    eng.io["output_format"] = ".{:06d}"          # Archive naming format
    eng.io["output_freq"] = 1000                 # Jobs per archive file

**Parameters:**

- **output**: String specifying the base directory name for storing output files
- **output_format**: Format string for archive file naming (supports Python string formatting)
- **output_freq**: Integer defining how many completed jobs to include per archive file

External Program Paths
~~~~~~~~~~~~~~~~~~~~~~

Specify paths to required external software:

.. code-block:: python

    # External program configuration
    eng.programs["mgltools"] = "/path/to/MGLTools-1.5.6"           # MGLTools installation
    eng.programs["vina"] = "/path/to/vina"                         # AutoDock Vina executable
    eng.programs["vinardo"] = "/path/to/smina"                     # Smina with Vinardo scoring (optional)
    eng.programs["voromqa"] = "/path/to/voronota-voromqa"          # VoroMQA executable

**Required Programs:**

- **mgltools**: Path to MGLTools installation directory (required for PDBQT conversion)
- **vina**: Path to AutoDock Vina executable (required for molecular docking)
- **vinardo**: Path to Smina executable with Vinardo scoring (optional alternative scorer)
- **voromqa**: Path to VoroMQA executable (optional for advanced model quality assessment)

Variant Generation Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Control how protein variants are generated:

.. code-block:: python

    # Variant generation settings
    eng.variant["name"] = "mutation"                   # Generation strategy
    eng.variant["selection"] = "A:100 A:150 A:200"     # Residues to mutate
    eng.variant["fixed"] = "A:50 A:75"                 # Fixed residues during optimization
    eng.variant["excluded_all"] = "CGP"                # Globally excluded amino acids
    eng.variant["excluded"] = {"A:100": "FWYM"}        # Position-specific exclusions
    eng.variant["conservative"] = True                 # Use conservative mutations
    eng.variant["max_iterations"] = 1000               # Maximum variants to generate
    eng.variant["combinations"] = 2                    # Maximum simultaneous mutations
    eng.variant["matrix"] = "blosum62"                 # Substitution matrix
    eng.variant["msa"] = "sequences.fasta"             # Multiple sequence alignment file

**Strategy Options:**

- **name**:
  - ``"mutation"``: Matrix-based mutagenesis using substitution matrices
  - ``"exhaustive"``: Systematic combinatorial mutagenesis
  - ``"msa"``: Sequence variants from FASTA files

**Selection Parameters:**

- **selection**: Space-separated list of residues to mutate (format: "ChainID:ResidueNumber")
- **fixed**: Space-separated list of residues to keep fixed during MODELLER optimization (format: "ChainID:ResidueNumber")
- **excluded_all**: String of amino acid single-letter codes to exclude globally
- **excluded**: Dictionary mapping residue positions to excluded amino acids

**Mutation Control:**

- **conservative**: Boolean controlling mutation bias (True = conservative, False = non-conservative)
- **max_iterations**: Maximum number of variants to generate
- **combinations**: Maximum number of simultaneous mutations per variant
- **matrix**: Substitution matrix name ("blosum62" or custom matrix specification)
- **msa**: Path to FASTA file containing sequences for FASTA-based variant generation

Structure Modeling Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Configure 3D structure modeling with MODELLER:

.. code-block:: python

    # MODELLER configuration
    eng.model["name"] = "modeller"             # Modeling method
    eng.model["num_models"] = 5                # Number of models per variant
    eng.model["optimize_radius"] = 8           # Optimization radius in Angstroms
    eng.model["optimize_level"] = 1            # Optimization thoroughness

**Parameters:**

- **name**: Modeling method (currently only "modeller" is supported)
- **num_models**: Number of 3D models to generate per sequence variant
- **optimize_radius**: Distance in Angstroms around mutations to optimize (0 = just mutated residues)
- **optimize_level**:
  - ``0``: Fast optimization (very_fast schedule, fast MD)
  - ``1``: Normal optimization (normal schedule, slow MD)
  - ``2``: Thorough optimization (slow schedule, very_slow MD)

Model Quality Assessment Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Configure quality filtering for generated models:

.. code-block:: python

    # Model quality thresholds
    eng.model_quality["norm_dope"] = -1.0      # Normalized DOPE score threshold
    eng.model_quality["voromqa"] = 0.4         # VoroMQA score threshold

**Quality Metrics:**

- **norm_dope**: Normalized DOPE score threshold (models with scores higher than this value are rejected)
- **voromqa**: VoroMQA score threshold (models with scores below this value are rejected, requires VoroMQA installation)

Molecular Docking Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Configure protein-ligand docking parameters:

.. code-block:: python

    # Docking configuration
    eng.evaluator["name"] = "vina"                         # Docking engine
    eng.evaluator["exhaustiveness"] = 32                   # Search thoroughness
    eng.evaluator["box_center"] = [10.0, 15.0, 20.0]       # Search box center (x, y, z)
    eng.evaluator["box_size"] = [20.0, 20.0, 20.0]         # Search box dimensions

**Docking Parameters:**

- **name**:
  - ``"vina"``: Standard AutoDock Vina scoring
  - ``"vinardo"``: Vinardo scoring function (using Smina)
- **exhaustiveness**: Search thoroughness (higher values = more thorough but slower)
- **box_center**: List of three floats defining the center coordinates of the docking search box
- **box_size**: List of three floats defining the dimensions of the search box in Angstroms

Ligand and Measurement Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Configure ligands and distance measurements:

.. code-block:: python

    # Add ligands and define measurements
    ligand = eng.add_ligand("substrate", "ligand.pdbqt")

    # Distance measurements between protein and ligand atoms
    ligand.add_measurement("catalytic_distance", "distance",
                          "chainID A and resid 87 and name N", "name O10")

**Ligand Methods:**

- **add_ligand(name, filename)**: Add a ligand for docking
  - ``name``: String identifier for the ligand
  - ``filename``: Path to PDBQT file containing ligand structure

**Measurement Methods:**

- **add_measurement(name, metric, protein_selection, ligand_selection)**: Add distance measurement
  - ``name``: Unique identifier for the measurement
  - ``metric``: Currently only "distance" is supported
  - ``protein_selection``: MDAnalysis selection string for protein atoms
  - ``ligand_selection``: MDAnalysis selection string for ligand atoms

Selection Syntax Reference
~~~~~~~~~~~~~~~~~~~~~~~~~~

GDEE uses MDAnalysis selection syntax for specifying atoms. Common examples:

.. code-block:: python

    # Atom-based selections
    "chainID A and resid 100 and name CA"     # Atom CA of residue 100 in chain A
    "name O10"                                # Atom O10 of the ligand

For complete syntax documentation, refer to the MDAnalysis selection documentation.


Example Workflows
-----------------

Gene Discovery
~~~~~~~~~~~~~~

Gene discovery workflows focus on exploring sequence variants from FASTA files to identify naturally occurring variants that may catalyze specific biochemical reactions.

.. code-block:: python

    from gdee import ProteinEngineering

    # Initialize the GDEE platform with protein name and database file
    # This creates the main workflow orchestrator and database connection
    eng = ProteinEngineering("target-protein", "db-filename")

    # Set the template PDB structure for homology modeling
    # This structure serves as the basis for generating 3D models of variants
    eng.pdb = "protein.pdb"

    # Configure paths to external programs required for the workflow
    # MGLTools: Used for converting PDB to PDBQT format for docking
    eng.programs["mgltools"] = "/path/to/MGLTools-1.5.6"
    # AutoDock Vina: Molecular docking engine for protein-ligand interactions
    eng.programs["vina"] = "/path/to/autodock_vina/vina"
    # VoroMQA: Model quality assessment using Voronoi tessellation
    eng.programs["voromqa"] = "/path/to/voronota/voronota-voromqa"

    # Configure file archiving and output management
    # Archive successful jobs to compressed files for storage efficiency
    eng.io["output"] = "files"
    # Naming format for archived files (5-digit zero-padded numbers)
    eng.io["output_format"] = ".{:05d}"
    # Archive every 50 completed jobs to balance I/O and storage
    eng.io["output_freq"] = 50

    # Configure distributed execution platform
    # Use MPI for parallel processing across multiple compute nodes
    eng.platform["name"] = "mpi"
    # Number of CPU cores to use on each MPI node for local parallelism
    eng.platform["local_cpu"] = 16

    # Configure 3D structure modeling parameters
    # Number of models to be used in the docking step
    eng.model["num_models"] = 5
    # Optimize residues within 8 Å of mutation sites (local optimization)
    eng.model["optimize_radius"] = 8
    # Use moderate optimization level (0=fast, 1=normal, 2=slow)
    eng.model["optimize_level"] = 1

    # Configure molecular docking parameters
    # Use AutoDock Vina as the docking engine
    eng.evaluator["name"] = "vina"
    # High exhaustiveness for thorough conformational search (default: 8)
    eng.evaluator["exhaustiveness"] = 200
    # Center coordinates of the docking search box (x, y, z in Angstroms)
    eng.evaluator["box_center"] = [32, 20, 39.5]
    # Dimensions of the search box (width, height, depth in Angstroms)
    eng.evaluator["box_size"] = [14, 11, 16]

    # Add a ligand for docking
    # Returns a Ligand object for further configuration
    lig = eng.add_ligand("ligand-name", "ligand-file.pdbqt")

    # Define distance measurements between protein and ligand atoms
    # These measurements will be computed for each docking pose
    # Measure distance from residue 292 CA atom to ligand CL1 atom
    lig.add_measurement("metric-1", "distance", "resid 292 and name CA", "name CL1")
    # Measure distance from residue 160 N atom to ligand O2 atom
    lig.add_measurement("metric-2", "distance", "resid 160 and name N", "name O2")
    # Measure distance from residue 220 NH2 group to ligand C12 atom
    lig.add_measurement("metric-3", "distance", "resid 220 and name NH2", "name C12")

    # Configure variant generation from FASTA file
    print("Running MSA")
    eng.variant["name"] = "msa"
    # FASTA file containing sequences to be evaluated
    eng.variant["msa"] = "sequences_20_identity.fasta"

    # Execute the complete gene discovery workflow
    # This will: generate variants → model structures → assess quality → dock ligands → measure distances
    eng.run()

    print("Done!")


Enzyme Engineering
~~~~~~~~~~~~~~~~~~

Enzyme engineering workflows focus on mutation-based optimization to improve specific biochemical reactions.

.. code-block:: python

    from gdee import ProteinEngineering

    # Initialize platform for enzyme engineering study
    eng = ProteinEngineering("target-protein", "db-filename")
    # Template structure of the enzyme to be engineered
    eng.pdb = "protein.pdb"

    # Configure external program paths (same as gene discovery)
    eng.programs["mgltools"] = "/path/to/MGLTools-1.5.6"
    eng.programs["vina"] = "/path/to/autodock_vina/vina"
    eng.programs["voromqa"] = "/path/to/voronota/voronota-voromqa"

    # Configure archiving for larger datasets typical in enzyme engineering
    eng.io["output"] = "files"
    eng.io["output_format"] = ".{:05d}"
    # Archive less frequently due to larger number of variants
    eng.io["output_freq"] = 2000

    # Use distributed execution for systematic mutagenesis screens
    eng.platform["name"] = "mpi"
    eng.platform["local_cpu"] = 16

    # Define structural constraints for mutagenesis
    # Fixed positions that will not be optimized by MODELLER (critical for structure/function)
    # Format: "ChainID:ResidueNumber" separated by spaces
    eng.variant["fixed"] = "A:237 A:206 A:160 A:87 A:161"
    # Positions selected for mutagenesis (target sites for optimization)
    eng.variant["selection"] = "A:185 A:159 A:88 A:93 A:92 A:208"

    # Define amino acid groups for systematic exclusion rules
    # Special amino acids that might disrupt structure
    SPECIAL = "CGP"  # Cysteine (disulfide), Glycine (flexible), Proline (rigid)
    # Charged amino acids
    POSITIVE = "RHK"  # Arginine, Histidine, Lysine
    NEGATIVE = "DE"   # Aspartate, Glutamate
    # Polar amino acids
    POLAR = "QNST"    # Glutamine, Asparagine, Serine, Threonine
    # Hydrophobic amino acids
    NON_POLAR = "AVILMFYW"  # Ala, Val, Ile, Leu, Met, Phe, Trp, Tyr

    # Global exclusion rule: don't use these amino acids at any position
    eng.variant["excluded_all"] = NEGATIVE + POSITIVE + SPECIAL

    # Position-specific exclusion rules for fine-tuned mutagenesis
    # Each position has customized restrictions based on structural role
    eng.variant["excluded"] = {
        "A:159": POLAR,      # Exclude polar residues at position 159
        "A:88": "FWY",       # Exclude bulky aromatics at position 88
        "A:93": "FWYM",      # Exclude aromatics and Met at position 93
        "A:92": NON_POLAR,   # Exclude hydrophobic residues at position 92
        "A:208": "FWYM",     # Exclude bulky residues at position 208
    }

    # Number of models to be used in the docking step
    eng.model["num_models"] = 5
    # Optimize residues within 8 Å of mutation sites (local optimization)
    eng.model["optimize_radius"] = 8
    # Use moderate optimization level (0=fast, 1=normal, 2=slow)
    eng.model["optimize_level"] = 1

    # Configure molecular docking parameters
    # Name of the docking engine to use
    eng.evaluator["name"] = "vina"
    # Exhaustiveness value used in the docking calculations
    eng.evaluator["exhaustiveness"] = 200
    # Center coordinates of the docking search box (x, y, z in Angstroms)
    eng.evaluator["box_center"] = [32, 20, 39.5]
    # Dimensions of the search box (width, height, depth in Angstroms)
    eng.evaluator["box_size"] = [14, 11, 16]

    # Add a ligand for docking
    # Returns a Ligand object for further configuration
    lig = eng.add_ligand("ligand-name", "ligand-file.pdbqt")

    # Define critical distance measurements for pose filtering
    # Distance from catalytic residue N87 to substrate O10 (catalytic interaction)
    lig.add_measurement("metric1", "distance", "chainID A and resid 87 and name N", "name O10")
    # Distance from binding residue N161 to substrate O10 (substrate positioning)
    lig.add_measurement("metric2", "distance", "chainID A and resid 161 and name N", "name O10")
    # Distance from S160 side chain to substrate C3 (cofactor interaction)
    lig.add_measurement("metric3", "distance", "chainID A and resid 160 and name OG", "name C3")
    lig.add_measurement("metric4", "distance", "chainID A and resid 185 and name CA", "name C2")

For enzyme engineering, two variant generation strategies are available:

1. Exhaustive combinatorial mutagenesis: Screen all possible double amino acid substitutions in selected residues

.. code-block:: python

    # Exhaustive combinatorial mutagenesis
    # Generate all possible combinations at selected positions
    print("Running all 2 by 2")
    eng.variant["name"] = "exhaustive"
    # Maximum number of simultaneous mutations per variant (In this example will generate single and double mutamts)
    eng.variant["combinations"] = 2

    # Execute pipeline
    eng.run()

    print("Done!")

2. Sampling-based mutagenesis: Sample the mutation space using conservative substitutions and scoring matrices

.. code-block:: python

    # Sampling approach using substitution matrices (e.g., BLOSUM62) to prioritize conservative changes
    print("Sampling 2 by 2")
    eng.variant["name"] = "mutation"
    # Use conservative amino acid substitutions based on evolutionary data
    eng.variant["conservative"] = True
    # Maximum number of variants to generate (prevents excessive library sizes)
    eng.variant["max_iterations"] = 50000
    # Maximum number of simultaneous mutations per variant (In this example will generate single and double mutamts)
    eng.variant["combinations"] = 2

    # Execute pipeline
    eng.run()

    print("Done!")


Filtering Docking Poses and Ranking Variants
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After workflow completion, use the analysis module to identify the most promising variants based on multiple criteria:

.. code-block:: python

    from gdee.database import Database
    from gdee.analysis.filters import Metric

    # Connect to the results database generated by the platform
    db = Database("db-filename.sqlite3")

    # Create metric objects for each measurement defined in the workflow
    # These correspond to the distance measurements added to ligands above
    metric1 = Metric("metric1", db)
    metric2 = Metric("metric2", db)
    metric3 = Metric("metric3", db)
    metric4 = Metric("metric4", db)

    # Define filtering rules based on biochemical criteria
    # Close catalytic contact for example (< 3.5 Å indicates proper positioning)
    rule1 = metric1 < 3.5
    # Close binding interaction for example (< 3.5 Å indicates strong binding)
    rule2 = metric2 < 3.5
    # Cofactor should be closer to substrate than pocket residue (selectivity)
    rule3 = metric4 > metric3

    # Combine rules using boolean logic
    # Variants must have either good catalytic OR binding contacts AND proper selectivity
    rule4 = (rule1 | rule2) & rule3

    # Create ranking based on binding energy (lower is better)
    # Variants passing the filtering rules are ranked by docking energy
    rank = rule4.rank()
    # Export all passing variants to SQLite database for further analysis
    rank.export_sqlite("rank.sqlite3", "All_Variants")

    # Filter by mutation count for systematic analysis
    # Single mutants are often easier to validate experimentally
    single = rank.by_num_mutations(1)
    single.export_sqlite("rank.sqlite3", "Single")
    # Export top 15 single mutants to CSV for easy inspection
    single.export_csv("rank_single.csv", 15)

    # Double mutants may show synergistic effects
    double = rank.by_num_mutations(2)
    double.export_sqlite("rank.sqlite3", "Double")
    # Export top 15 double mutants for experimental follow-up
    double.export_csv("rank_double.csv", 15)


Rescoring with Machine Learning Scoring Function
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After completing initial docking and ranking workflows, you can rescore the candidates using a machine learning-based scoring function.

.. note::

   Machine learning scoring functions must be trained prior to use. The rescoring workflow requires pre-trained model files in pickle format:

   - **RF-Score models**: Random Forest-based scoring functions (v1, v2, v3 variants)
   - **PLECnn model**: Deep learning model using Protein-Ligand Extended Connectivity interaction fingerprints

For training these models, refer to the ODDT package documentation.


.. code-block:: python

    from gdee.database import Ranked_Database
    from gdee.engineer import RescoreVariants

    # Load the ranked database from previous workflow filtering
    input_db = Ranked_Database("rank.sqlite3")

    # Specify output database for rescored results
    # This will create a new database with all rescored poses
    output_db = "rank_rescored.sqlite3"

    # Initialize rescoring workflow
    rescoring = RescoreVariants(
        input_db,                                       # Input database with ranked variants
        output_db,                                      # Output database for rescored results
        "AllVariants",                                  # Table name containing variants to rescore
        [
            "RFScore_v1_pdbbind2016.pickle",            # path to RF-Score v1 model
            "RFScore_v2_pdbbind2016.pickle",            # path to RF-Score v2 model
            "RFScore_v3_pdbbind2016.pickle",            # path to RF-Score v3 model
            "PLECnn_p5_l1_pdbbind2016_s65536.pickle"    # path to PLECnn deep learning model
        ],
        "files"                                         # Directory containing structure files
    )

    # Configure distributed or simple execution for rescoring
    rescoring.platform["name"] = "mpi"
    # Number of CPU cores to use on each MPI node for local parallelism
    rescoring.platform["local_cpu"] = 16

    # Execute rescoring workflow
    print("Running rescore")
    rescoring.run()
    print("Done!")


Useful Scripts
--------------

Script to run BLAST search against Swissprot database and save results in XML format:

.. code-block:: python

    from Bio import SeqIO
    from Bio.Blast import NCBIWWW

    # Configuration parameters for BLAST search
    FASTA = "target-protein.fasta"        # Input protein sequence file
    DATABASE = "swissprot"                # database
    E_VALUE = 10                          # E-value threshold
    MAX_HITS = 10000                      # Maximum number of sequences to retrieve
    MIN_IDENTITY = 10                     # Minimum percent identity threshold
    OUTPUT = "blast_results.xml"          # Output file in XML format for parsing

    # Read the query protein sequence from FASTA file
    query = SeqIO.read(FASTA, format="fasta")

    # Perform online BLAST search
    result = NCBIWWW.qblast("blastp", DATABASE, query.seq, expect=E_VALUE,
                            hitlist_size=MAX_HITS, perc_ident=MIN_IDENTITY)

    # Save BLAST results in XML format for subsequent parsing
    # XML format preserves all alignment details and statistics
    with open(OUTPUT, "w") as fd:
        fd.write(result.read())


Script to filter BLAST results based on coverage and identity, and save sequences in FASTA format:

.. code-block:: python

    from Bio import SeqIO
    from Bio.Seq import Seq
    from Bio.SeqRecord import SeqRecord
    from Bio.Blast import NCBIXML

    # Configuration parameters for filtering BLAST results
    OUTPUT = "sequences_single_20_identity.fasta"   # Output FASTA file with filtered sequences
    BLAST_XML = "blast_results.xml"                 # Input XML file from BLAST search
    MIN_COVERAGE = 0.8                              # Minimum query coverage
    MIN_IDENTITY = 0.2                              # Minimum sequence identity

    # Parse BLAST results from XML file
    with open("blast_results.xml") as fd:
        blast = NCBIXML.read(fd)

    # Initialize lists to collect filtered sequences and statistics
    expects = []      # E-values for statistical analysis
    sequences = []    # Filtered sequences for output

    # Iterate through all database matches (alignments) from BLAST
    for aln in blast.alignments:
        # Get the best HSP (High-scoring Segment Pair) for each alignment
        # HSPs represent local alignments between query and database sequence
        hsp = aln.hsps[0]

        # Calculate query coverage as fraction of query sequence aligned
        # Higher coverage indicates more complete homology
        coverage = (hsp.query_end - hsp.query_start + 1) / blast.query_length

        # Calculate sequence identity as fraction of identical residues
        # Higher identity indicates closer evolutionary relationship
        identity = hsp.identities / hsp.align_length

        # Skip identical sequences (identity = 1.0) to avoid self-matches
        # These don't provide new information for diversity analysis
        if identity == 1:
            continue

        # Apply coverage and identity filters to select high-quality homologs
        # These thresholds ensure sequences are sufficiently similar and complete
        if coverage >= MIN_COVERAGE and identity >= MIN_IDENTITY:
            # Extract aligned subject sequence and remove gap characters
            # Gaps (-) are alignment artifacts and should be removed
            seq = Seq(hsp.sbjct).replace("-", "")

            # Create sequence record with database accession as identifier
            # Empty description and annotation fields for simplicity
            record = SeqRecord(seq, aln.accession, "", "")
            sequences.append(record)

            # Store E-value for statistical summary
            expects.append(hsp.expect)

    # Print summary statistics about filtering results
    # Helps assess the quality and diversity of the filtered dataset
    print("Filtered {} sequences from {}".format(len(sequences), len(blast.alignments)))
    print("Expect values. Max: {}. Min: {}".format(max(expects), min(expects)))

    # Write filtered sequences to FASTA file
    SeqIO.write(sequences, OUTPUT, "fasta")


The resulting FASTA file is ready for use in the GDEE platform for gene discovery workflows.