Troubleshooting Guide

This guide covers common issues and their solutions when using CryoPARES.

Table of Contents

Installation Issues
Configuration Issues
File System Issues
Memory Issues
Training Issues
Inference Issues
Data Issues
Performance Issues
CUDA/GPU Issues
Output Quality Issues

Installation Issues

pip install fails with dependency conflicts

Symptoms:

ERROR: pip's dependency resolver does not currently take into account all the packages

Solution:

Create a fresh conda environment:

conda create -n cryopares_fresh python=3.12
conda activate cryopares_fresh

Install in order:

pip install git+https://github.com/rsanchezgarc/cryoPARES.git

ImportError: No module named ‘cryoPARES’

Symptoms:

ImportError: No module named 'cryoPARES'

Solutions:

Verify installation:

pip list | grep cryoPARES

If using development install, check you’re in the right directory:

cd /path/to/cryoPARES
pip install -e .

Check Python environment:

which python
# Should point to your conda environment

CUDA version mismatch

Symptoms:

RuntimeError: CUDA error: no kernel image is available for execution on the device

Solution:

Reinstall PyTorch with correct CUDA version:

# Check CUDA version
nvidia-smi

# Install matching PyTorch (example for CUDA 11.8)
pip install --force-reinstall numpy scikit-image scipy torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -e .

Configuration Issues

Type mismatch errors when using –config

Symptoms:

TypeError: argument must be int, not float
ValueError: could not convert string to float

Cause: Parameter types must match exactly when using --config overrides.

Solution:

Always match the parameter type:

For float parameters - include a decimal point:

# Correct:
--config train.learning_rate=1e-3
--config datamanager.particlesDataset.sampling_rate_angs_for_nnet=2.0

# Wrong:
--config train.learning_rate=1      # Missing decimal - will fail!
--config sampling_rate_angs_for_nnet=2  # Missing decimal - will fail!

For int parameters - do NOT include a decimal point:

# Correct:
--config models.image2sphere.lmax=8
--config train.n_epochs=100

# Wrong:
--config models.image2sphere.lmax=8.0    # Has decimal - will fail!
--config train.n_epochs=100.0            # Has decimal - will fail!

Check parameter types:

python -m cryopares_train --show-config | grep parameter_name
# Look for (int) or (float) annotation

File System Issues

“Too many open files” error

Symptoms:

OSError: [Errno 24] Too many open files

Root cause: CryoPARES opens one file handler for each .mrcs file in the .star file.

Solutions:

Immediate fix: Increase file descriptor limit:

ulimit -n 65536

Permanent fix: Add to .bashrc or .bash_profile:

echo "ulimit -n 65536" >> ~/.bashrc
source ~/.bashrc

If you cannot set it to a large enough number, join your particles stacsk into a smaller number of .mrcs files

System-wide fix (requires root):

# Edit /etc/security/limits.conf
sudo nano /etc/security/limits.conf

# Add these lines:
* soft nofile 65536
* hard nofile 65536

Verify limit:

ulimit -n
# Should show 65536 or higher

Permission denied when writing outputs

Symptoms:

PermissionError: [Errno 13] Permission denied: '/path/to/output'

Solutions:

Check directory permissions:

ls -ld /path/to/output

Create directory with correct permissions:

mkdir -p /path/to/output
chmod 755 /path/to/output

Use a different output directory:

--train_save_dir ~/cryopares_outputs/

Particle files not found

Symptoms:

FileNotFoundError: Particle file not found: /path/to/particles/...

Solutions:

Check --particles_dir argument:

# If .star file has relative paths like:
# MotionCorr/job01/particles_001.mrcs
#And the MotionCorr directory is at  /path/to/relion/project/

# Use:
--particles_dir /path/to/relion/project/

Verify .star file paths:

head -20 /path/to/particles.star
# Check the _rlnImageName column

Make paths absolute:

import starfile
df = starfile.read('/path/to/particles.star')
df['rlnImageName'] = df['rlnImageName'].apply(
    lambda x: f'/absolute/path/{x}'
)
starfile.write(df, 'particles_absolute.star')

Memory Issues

Out of memory (OOM) during training

Symptoms:

RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB

Solutions:

Reduce batch size:

--batch_size 16

You can compensate, up to a certain point, the batch reduction by increasing accumulate_gra_batches to keep the effective batch size (batch_size x accumulate_grad_batches) constant

--config train.accumulate_grad_batches=32
# Maintains effective batch size while reducing memory

Reduce image size:

--config datamanager.particlesDataset.image_size_px_for_nnet=96

and/or

--config datamanager.particlesDataset.sampling_rate_angs_for_nnet=2.0

Note that larger sampling rate implies smaller images. Don’t try to reduce the image size at inference, it won’t work, as the checkpoint is prepared for the training size

Reduce model complexity:

There are several parameters that severely contribute to the model size. Decrease them to have a smaller network.

--config models.image2sphere.lmax=10 #Default is 12

--config  models.image2sphere.so3components.i2sprojector.sphere_fdim=256 #Default is 512

--config  models.image2sphere.so3components.s2conv.f_out=32 #Default is 64

--config models.image2sphere.imageencoder.out_channels=256

Out of memory during inference

Solutions:

Reduce inference batch size:

--batch_size 16

Training Issues

Loss becomes NaN

Symptoms:

loss: nan, geo_degs: nan

Causes and solutions:

Learning rate too high:

--config train.learning_rate=1e-4

Numerical instability:

--config train.weight_decay=1e-6  # Reduce regularization

Bad data: Check for corrupted particles or extreme values in .star file

Training doesn’t improve

Symptoms:

Loss plateaus immediately
val_geo_degs > 30° after many epochs

Diagnostic steps:

Verify data is pre-aligned:

# Check that .star file contains orientation columns:
# _rlnAngleRot, _rlnAngleTilt, _rlnAnglePsi
grep "rlnAngle" /path/to/particles.star

Test overfitting capability:

python -m cryopares_train \
    --symmetry C1 \
    --particles_star_fname data.star \
    --train_save_dir /tmp/overfit_test \
    --n_epochs 100 \
    --overfit_batches 10

Check symmetry: Wrong symmetry can make training impossible.
Increase learning rate:

--config train.learning_rate=5e-3

Increase model capacity:

--config models.image2sphere.lmax=14

Validation loss higher than training loss

Normal: Small gap (< 20%) is expected and healthy.

Concerning: Large gap (> 50%) indicates overfitting.

Solutions for overfitting:

Increase regularization:

--config train.weight_decay=1e-4

Reduce model complexity:

--config models.image2sphere.lmax=10

More training data: Use more particles in .star file
Check for data leakage: Ensure train/val split is correct

Training crashes with “Killed”

Symptoms: Process killed with no error message.

Cause: System OOM (out of RAM, not GPU memory).

Solutions:

Reduce workers:

--num_dataworkers 2

Reduce batch size:

--batch_size 16

Monitor memory:

watch -n 1 free -h

Checkpoints not saving

Symptoms: No .ckpt files in train_save_dir/version_0/half1/checkpoints/

Solutions:

Check disk space:

df -h /path/to/train_save_dir

Check write permissions:

ls -ld /path/to/train_save_dir

Verify training completes at least one epoch: Check logs for validation step completion

Inference Issues

Predicted poses are random

Symptoms:

Angular error > 90°
Reconstruction looks like noise

Causes and solutions:

Wrong checkpoint directory:

# Should point to version_0, not to half1 or half2
--checkpoint_dir /path/to/training/version_0
# NOT: /path/to/training/version_0/half1

Model not trained: Check training logs to verify training completed
Different molecule: Model trained on different protein than inference data
Wrong symmetry: Verify symmetry matches training

Reconstruction is blurry

Causes and solutions:

Insufficient local refinement:

--config projmatching.grid_distance_degs=10.0

Note. Increasing this parameter will make running times and memory consumption massively grow.

Too strict confidence filtering:

--config inference.directional_zscore_thr=1.5

Not enough particles passing filter: Check output .star file size
Wrong reference map: Provide better reference:

--reference_map /path/to/good_reference.mrc

“No particles passed confidence threshold”

Symptoms:

Warning: No particles passed directional_zscore_thr=2.0

Solutions:

Lower threshold:

--config inference.directional_zscore_thr=1.0

Disable filtering:

--config inference.directional_zscore_thr=None

Check if model and data match: Verify you’re using the correct trained model

Inference slower than expected

Solutions:

Increase batch size:

--batch_size 64

But this increases GPU memory consumption

Reduce top_k predictions: If you enabled inference.top_k_poses_nnet>1, then decrease the number

--config inference.top_k_poses_nnet=1

Skip reconstruction if not needed:

--config inference.skip_reconstruction=True

Use GPU:

--config inference.use_cuda=True

Compile model:

--compile_model

Data Issues

CTF parameters missing

Symptoms:

KeyError: '_rlnDefocusU'

Solution:

CryoPARES requires CTF parameters in .star file. Ensure your .star file was generated after CTF estimation (e.g., CTFFIND or Gctf in RELION).

Required CTF columns:

_rlnDefocusU
_rlnDefocusV
_rlnDefocusAngle
_rlnVoltage
_rlnSphericalAberration
_rlnAmplitudeContrast

Orientation columns missing

Symptoms:

KeyError: '_rlnAngleRot'

Solution:

For training, you need pre-aligned particles with:

_rlnAngleRot
_rlnAngleTilt
_rlnAnglePsi
_rlnOriginXAngst
_rlnOriginYAngst

For inference, orientations are optional (will be predicted).

Particles are poorly centered

Symptoms:

High angular errors
Poor reconstruction quality

Solution:

Re-center particles in RELION:

Extract particles with larger box size
Run 2D or 3D classification
Re-extract centered particles

Different sampling rate in data vs. reference

Symptoms:

RuntimeError: Size mismatch between particles and reference

Solution:

CryoPARES handles rescaling automatically, but verify:

Sampling rates are specified in .star file: Check for _rlnImagePixelSize or _rlnDetectorPixelSize
Reference map has correct pixel size:

# Check with mrcfile
python -c "import mrcfile; print(mrcfile.open('ref.mrc').voxel_size)"

Performance Issues

Training is very slow

Diagnostic:

Check if GPU is being used:

import torch
print(torch.cuda.is_available())  # Should be True
print(torch.cuda.device_count())   # Should be > 0

Solutions:

Enable model compilation:

--compile_model

Increase batch size:

--batch_size 64

Reduce image size:

--config datamanager.particlesDataset.image_size_px_for_nnet=96

Use multiple GPUs: CryoPARES automatically uses all available GPUs
Reduce model complexity:

--config models.image2sphere.lmax=12

Check data loading is not bottleneck:

--num_dataworkers 8

You might want to use top/htop to monitor CPU usage as well as IO. If you see your workers in S status, it is a sign of IO bottleneck.

Use faster precision:

--config float32_matmul_precision="medium"

Inference is very slow

Solutions:

Increase batch size:

--batch_size 64

Reduce angular search range:

--config projmatching.grid_distance_degs=4.0

Skip local refinement (if acceptable): and reconstruction

--config inference.skip_localrefinement=True inference.skip_reconstruction=True

Use coarser search:

--config projmatching.grid_step_degs=3.0

Data loading is bottleneck

Symptoms:

GPU utilization < 50%
High CPU usage from data workers

Solutions:

Increase workers:

--num_dataworkers 8

Use faster storage: Move data to local SSD instead of network drive
Reduce preprocessing: Ensure images don’t require heavy rescaling

CUDA/GPU Issues

“CUDA out of memory” but GPU seems empty

Cause: Memory fragmentation.

Solutions:

Restart training:

# Clear GPU memory
nvidia-smi
# Kill any zombie processes

Reduce batch size: Even if GPU shows free memory, fragmentation can prevent allocation

Multiple GPUs not being used

Diagnostic:

nvidia-smi
# Only GPU 0 shows activity

Solution:

CryoPARES uses PyTorch Lightning’s automatic multi-GPU training. To force specific GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m cryopares_train ...

GPU slower than expected

Solutions:

Check GPU is not throttling:

nvidia-smi --query-gpu=temperature.gpu,clocks.current.graphics --format=csv

Ensure not using integrated graphics:

nvidia-smi --query-gpu=name --format=csv

Update NVIDIA drivers:

nvidia-smi
# Check driver version, update if old

Output Quality Issues

Reconstruction has artifacts

Causes and solutions:

Overfitting to noise:
- Use directional_zscore_thr to filter low-confidence particles
- Use matching half-sets (default behavior)
CTF correction issues:
- Verify CTF parameters are correct
Mask too tight:

--config datamanager.particlesDataset.mask_radius_angs=150  # Increase

Insufficient particles: Check how many particles passed filtering

FSC is lower than expected

Solutions:

More thorough local refinement:

--config projmatching.grid_distance_degs=10.0 \
        projmatching.grid_step_degs=1.0

Filter particles more aggressively:

--config inference.directional_zscore_thr=2.5

Reconstructed map has wrong hand

Solution:

The model learns the hand from training data. If training data had wrong hand, flip reference map:

import mrcfile
vol = mrcfile.open('volume.mrc').data
vol_flipped = vol[:, :, ::-1]  # Flip along z
mrcfile.write('volume_flipped.mrc', vol_flipped)

Getting More Help

Enable Debug Mode

For more verbose output:

python -m cryopares_train \
    --show_debug_stats \
    ... other args ...

Check Logs

Training logs are saved to:

train_save_dir/version_0/half*/

Report Issues

If you encounter a bug:

Check existing issues: https://github.com/rsanchezgarc/cryoPARES/issues
Provide:
- Full command used
- Error message and stack trace
- CryoPARES version: pip show cryopares
- PyTorch version: python -c "import torch; print(torch.__version__)"
- CUDA version: nvidia-smi
Minimal reproducible example: Create smallest possible example that shows the bug

Troubleshooting Guide

Table of Contents

Installation Issues

pip install fails with dependency conflicts

ImportError: No module named ‘cryoPARES’

CUDA version mismatch

Configuration Issues

Type mismatch errors when using –config

File System Issues

“Too many open files” error

Permission denied when writing outputs

Particle files not found

Memory Issues

Out of memory (OOM) during training

Out of memory during inference

RAM exhausted when loading data

Training Issues

Loss becomes NaN

Training doesn’t improve

Validation loss higher than training loss

Training crashes with “Killed”

Checkpoints not saving

Inference Issues

Predicted poses are random

Reconstruction is blurry

“No particles passed confidence threshold”

Inference slower than expected

Data Issues

CTF parameters missing

Orientation columns missing

Particles are poorly centered

Different sampling rate in data vs. reference

Performance Issues

Training is very slow

Inference is very slow

Data loading is bottleneck

CUDA/GPU Issues

“CUDA out of memory” but GPU seems empty

Multiple GPUs not being used

GPU slower than expected

Output Quality Issues

Reconstruction has artifacts

FSC is lower than expected

Reconstructed map has wrong hand

Getting More Help

Enable Debug Mode

Check Logs

Report Issues

See Also