Troubleshooting Guide

This guide covers common issues and their solutions when using CryoPARES.

Table of Contents


Installation Issues

pip install fails with dependency conflicts

Symptoms:

ERROR: pip's dependency resolver does not currently take into account all the packages

Solution:

  1. Create a fresh conda environment:

conda create -n cryopares_fresh python=3.12
conda activate cryopares_fresh
  1. Install in order:

pip install git+https://github.com/rsanchezgarc/cryoPARES.git

ImportError: No module named ‘cryoPARES’

Symptoms:

ImportError: No module named 'cryoPARES'

Solutions:

  1. Verify installation:

pip list | grep cryoPARES
  1. If using development install, check you’re in the right directory:

cd /path/to/cryoPARES
pip install -e .
  1. Check Python environment:

which python
# Should point to your conda environment

CUDA version mismatch

Symptoms:

RuntimeError: CUDA error: no kernel image is available for execution on the device

Solution:

Reinstall PyTorch with correct CUDA version:

# Check CUDA version
nvidia-smi

# Install matching PyTorch (example for CUDA 11.8)
pip install --force-reinstall numpy scikit-image scipy torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -e .

Configuration Issues

Type mismatch errors when using –config

Symptoms:

TypeError: argument must be int, not float
ValueError: could not convert string to float

Cause: Parameter types must match exactly when using --config overrides.

Solution:

Always match the parameter type:

  1. For float parameters - include a decimal point:

    # Correct:
    --config train.learning_rate=1e-3
    --config datamanager.particlesDataset.sampling_rate_angs_for_nnet=2.0
    
    # Wrong:
    --config train.learning_rate=1      # Missing decimal - will fail!
    --config sampling_rate_angs_for_nnet=2  # Missing decimal - will fail!
    
  2. For int parameters - do NOT include a decimal point:

    # Correct:
    --config models.image2sphere.lmax=8
    --config train.n_epochs=100
    
    # Wrong:
    --config models.image2sphere.lmax=8.0    # Has decimal - will fail!
    --config train.n_epochs=100.0            # Has decimal - will fail!
    
  3. Check parameter types:

    python -m cryopares_train --show-config | grep parameter_name
    # Look for (int) or (float) annotation
    

File System Issues

“Too many open files” error

Symptoms:

OSError: [Errno 24] Too many open files

Root cause: CryoPARES opens one file handler for each .mrcs file in the .star file.

Solutions:

  1. Immediate fix: Increase file descriptor limit:

ulimit -n 65536
  1. Permanent fix: Add to .bashrc or .bash_profile:

echo "ulimit -n 65536" >> ~/.bashrc
source ~/.bashrc

If you cannot set it to a large enough number, join your particles stacsk into a smaller number of .mrcs files

  1. System-wide fix (requires root):

# Edit /etc/security/limits.conf
sudo nano /etc/security/limits.conf

# Add these lines:
* soft nofile 65536
* hard nofile 65536
  1. Verify limit:

ulimit -n
# Should show 65536 or higher

Permission denied when writing outputs

Symptoms:

PermissionError: [Errno 13] Permission denied: '/path/to/output'

Solutions:

  1. Check directory permissions:

ls -ld /path/to/output
  1. Create directory with correct permissions:

mkdir -p /path/to/output
chmod 755 /path/to/output
  1. Use a different output directory:

--train_save_dir ~/cryopares_outputs/

Particle files not found

Symptoms:

FileNotFoundError: Particle file not found: /path/to/particles/...

Solutions:

  1. Check --particles_dir argument:

# If .star file has relative paths like:
# MotionCorr/job01/particles_001.mrcs
#And the MotionCorr directory is at  /path/to/relion/project/

# Use:
--particles_dir /path/to/relion/project/
  1. Verify .star file paths:

head -20 /path/to/particles.star
# Check the _rlnImageName column
  1. Make paths absolute:

import starfile
df = starfile.read('/path/to/particles.star')
df['rlnImageName'] = df['rlnImageName'].apply(
    lambda x: f'/absolute/path/{x}'
)
starfile.write(df, 'particles_absolute.star')

Memory Issues

Out of memory (OOM) during training

Symptoms:

RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB

Solutions:

  1. Reduce batch size:

--batch_size 16

You can compensate, up to a certain point, the batch reduction by increasing accumulate_gra_batches to keep the effective batch size (batch_size x accumulate_grad_batches) constant

--config train.accumulate_grad_batches=32
# Maintains effective batch size while reducing memory
  1. Reduce image size:

--config datamanager.particlesDataset.image_size_px_for_nnet=96

and/or

--config datamanager.particlesDataset.sampling_rate_angs_for_nnet=2.0

Note that larger sampling rate implies smaller images. Don’t try to reduce the image size at inference, it won’t work, as the checkpoint is prepared for the training size

  1. Reduce model complexity:

There are several parameters that severely contribute to the model size. Decrease them to have a smaller network.

--config models.image2sphere.lmax=10 #Default is 12
--config  models.image2sphere.so3components.i2sprojector.sphere_fdim=256 #Default is 512
--config  models.image2sphere.so3components.s2conv.f_out=32 #Default is 64
--config models.image2sphere.imageencoder.out_channels=256

Out of memory during inference

Solutions:

  1. Reduce inference batch size:

--batch_size 16

RAM exhausted when loading data

Symptoms:

MemoryError: Unable to allocate array

Solutions:

  1. Disable in-memory caching:

--config datamanager.particlesDataset.store_data_in_memory=False
  1. Reduce number of workers:

--num_dataworkers 2

Training Issues

Loss becomes NaN

Symptoms:

loss: nan, geo_degs: nan

Causes and solutions:

  1. Learning rate too high:

--config train.learning_rate=1e-4
  1. Numerical instability:

--config train.weight_decay=1e-6  # Reduce regularization
  1. Bad data: Check for corrupted particles or extreme values in .star file

Training doesn’t improve

Symptoms:

  • Loss plateaus immediately

  • val_geo_degs > 30° after many epochs

Diagnostic steps:

  1. Verify data is pre-aligned:

# Check that .star file contains orientation columns:
# _rlnAngleRot, _rlnAngleTilt, _rlnAnglePsi
grep "rlnAngle" /path/to/particles.star
  1. Test overfitting capability:

python -m cryopares_train \
    --symmetry C1 \
    --particles_star_fname data.star \
    --train_save_dir /tmp/overfit_test \
    --n_epochs 100 \
    --overfit_batches 10
  1. Check symmetry: Wrong symmetry can make training impossible.

  2. Increase learning rate:

--config train.learning_rate=5e-3
  1. Increase model capacity:

--config models.image2sphere.lmax=14

Validation loss higher than training loss

Normal: Small gap (< 20%) is expected and healthy.

Concerning: Large gap (> 50%) indicates overfitting.

Solutions for overfitting:

  1. Increase regularization:

--config train.weight_decay=1e-4
  1. Reduce model complexity:

--config models.image2sphere.lmax=10
  1. More training data: Use more particles in .star file

  2. Check for data leakage: Ensure train/val split is correct

Training crashes with “Killed”

Symptoms: Process killed with no error message.

Cause: System OOM (out of RAM, not GPU memory).

Solutions:

  1. Reduce workers:

--num_dataworkers 2
  1. Reduce batch size:

--batch_size 16
  1. Monitor memory:

watch -n 1 free -h

Checkpoints not saving

Symptoms: No .ckpt files in train_save_dir/version_0/half1/checkpoints/

Solutions:

  1. Check disk space:

df -h /path/to/train_save_dir
  1. Check write permissions:

ls -ld /path/to/train_save_dir
  1. Verify training completes at least one epoch: Check logs for validation step completion


Inference Issues

Predicted poses are random

Symptoms:

  • Angular error > 90°

  • Reconstruction looks like noise

Causes and solutions:

  1. Wrong checkpoint directory:

# Should point to version_0, not to half1 or half2
--checkpoint_dir /path/to/training/version_0
# NOT: /path/to/training/version_0/half1
  1. Model not trained: Check training logs to verify training completed

  2. Different molecule: Model trained on different protein than inference data

  3. Wrong symmetry: Verify symmetry matches training

Reconstruction is blurry

Causes and solutions:

  1. Insufficient local refinement:

--config projmatching.grid_distance_degs=10.0

Note. Increasing this parameter will make running times and memory consumption massively grow.

  1. Too strict confidence filtering:

--config inference.directional_zscore_thr=1.5
  1. Not enough particles passing filter: Check output .star file size

  2. Wrong reference map: Provide better reference:

--reference_map /path/to/good_reference.mrc

“No particles passed confidence threshold”

Symptoms:

Warning: No particles passed directional_zscore_thr=2.0

Solutions:

  1. Lower threshold:

--config inference.directional_zscore_thr=1.0
  1. Disable filtering:

--config inference.directional_zscore_thr=None
  1. Check if model and data match: Verify you’re using the correct trained model

Inference slower than expected

Solutions:

  1. Increase batch size:

--batch_size 64

But this increases GPU memory consumption

  1. Reduce top_k predictions: If you enabled inference.top_k_poses_nnet>1, then decrease the number

--config inference.top_k_poses_nnet=1
  1. Skip reconstruction if not needed:

--config inference.skip_reconstruction=True
  1. Use GPU:

--config inference.use_cuda=True
  1. Compile model:

--compile_model

Data Issues

CTF parameters missing

Symptoms:

KeyError: '_rlnDefocusU'

Solution:

CryoPARES requires CTF parameters in .star file. Ensure your .star file was generated after CTF estimation (e.g., CTFFIND or Gctf in RELION).

Required CTF columns:

  • _rlnDefocusU

  • _rlnDefocusV

  • _rlnDefocusAngle

  • _rlnVoltage

  • _rlnSphericalAberration

  • _rlnAmplitudeContrast

Orientation columns missing

Symptoms:

KeyError: '_rlnAngleRot'

Solution:

For training, you need pre-aligned particles with:

  • _rlnAngleRot

  • _rlnAngleTilt

  • _rlnAnglePsi

  • _rlnOriginXAngst

  • _rlnOriginYAngst

For inference, orientations are optional (will be predicted).

Particles are poorly centered

Symptoms:

  • High angular errors

  • Poor reconstruction quality

Solution:

Re-center particles in RELION:

  1. Extract particles with larger box size

  2. Run 2D or 3D classification

  3. Re-extract centered particles

Different sampling rate in data vs. reference

Symptoms:

RuntimeError: Size mismatch between particles and reference

Solution:

CryoPARES handles rescaling automatically, but verify:

  1. Sampling rates are specified in .star file: Check for _rlnImagePixelSize or _rlnDetectorPixelSize

  2. Reference map has correct pixel size:

# Check with mrcfile
python -c "import mrcfile; print(mrcfile.open('ref.mrc').voxel_size)"

Performance Issues

Training is very slow

Diagnostic:

Check if GPU is being used:

import torch
print(torch.cuda.is_available())  # Should be True
print(torch.cuda.device_count())   # Should be > 0

Solutions:

  1. Enable model compilation:

--compile_model
  1. Increase batch size:

--batch_size 64
  1. Reduce image size:

--config datamanager.particlesDataset.image_size_px_for_nnet=96
  1. Use multiple GPUs: CryoPARES automatically uses all available GPUs

  2. Reduce model complexity:

--config models.image2sphere.lmax=12
  1. Check data loading is not bottleneck:

--num_dataworkers 8

You might want to use top/htop to monitor CPU usage as well as IO. If you see your workers in S status, it is a sign of IO bottleneck.

  1. Use faster precision:

--config float32_matmul_precision="medium"

Inference is very slow

Solutions:

  1. Increase batch size:

--batch_size 64
  1. Reduce angular search range:

--config projmatching.grid_distance_degs=4.0
  1. Skip local refinement (if acceptable): and reconstruction

--config inference.skip_localrefinement=True inference.skip_reconstruction=True
  1. Use coarser search:

--config projmatching.grid_step_degs=3.0

Data loading is bottleneck

Symptoms:

  • GPU utilization < 50%

  • High CPU usage from data workers

Solutions:

  1. Increase workers:

--num_dataworkers 8
  1. Use faster storage: Move data to local SSD instead of network drive

  2. Reduce preprocessing: Ensure images don’t require heavy rescaling


CUDA/GPU Issues

“CUDA out of memory” but GPU seems empty

Cause: Memory fragmentation.

Solutions:

  1. Restart training:

# Clear GPU memory
nvidia-smi
# Kill any zombie processes
  1. Reduce batch size: Even if GPU shows free memory, fragmentation can prevent allocation

Multiple GPUs not being used

Diagnostic:

nvidia-smi
# Only GPU 0 shows activity

Solution:

CryoPARES uses PyTorch Lightning’s automatic multi-GPU training. To force specific GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m cryopares_train ...

GPU slower than expected

Solutions:

  1. Check GPU is not throttling:

nvidia-smi --query-gpu=temperature.gpu,clocks.current.graphics --format=csv
  1. Ensure not using integrated graphics:

nvidia-smi --query-gpu=name --format=csv
  1. Update NVIDIA drivers:

nvidia-smi
# Check driver version, update if old

Output Quality Issues

Reconstruction has artifacts

Causes and solutions:

  1. Overfitting to noise:

    • Use directional_zscore_thr to filter low-confidence particles

    • Use matching half-sets (default behavior)

  2. CTF correction issues:

    • Verify CTF parameters are correct

  3. Mask too tight:

--config datamanager.particlesDataset.mask_radius_angs=150  # Increase
  1. Insufficient particles: Check how many particles passed filtering

FSC is lower than expected

Solutions:

  1. More thorough local refinement:

--config projmatching.grid_distance_degs=10.0 \
        projmatching.grid_step_degs=1.0
  1. Filter particles more aggressively:

--config inference.directional_zscore_thr=2.5

Reconstructed map has wrong hand

Solution:

The model learns the hand from training data. If training data had wrong hand, flip reference map:

import mrcfile
vol = mrcfile.open('volume.mrc').data
vol_flipped = vol[:, :, ::-1]  # Flip along z
mrcfile.write('volume_flipped.mrc', vol_flipped)

Getting More Help

Enable Debug Mode

For more verbose output:

python -m cryopares_train \
    --show_debug_stats \
    ... other args ...

Check Logs

Training logs are saved to:

train_save_dir/version_0/half*/

Report Issues

If you encounter a bug:

  1. Check existing issues: https://github.com/rsanchezgarc/cryoPARES/issues

  2. Provide:

    • Full command used

    • Error message and stack trace

    • CryoPARES version: pip show cryopares

    • PyTorch version: python -c "import torch; print(torch.__version__)"

    • CUDA version: nvidia-smi

  3. Minimal reproducible example: Create smallest possible example that shows the bug


See Also