Troubleshooting Guide
This guide covers common issues and their solutions when using CryoPARES.
Table of Contents
Installation Issues
pip install fails with dependency conflicts
Symptoms:
ERROR: pip's dependency resolver does not currently take into account all the packages
Solution:
Create a fresh conda environment:
conda create -n cryopares_fresh python=3.12
conda activate cryopares_fresh
Install in order:
pip install git+https://github.com/rsanchezgarc/cryoPARES.git
ImportError: No module named ‘cryoPARES’
Symptoms:
ImportError: No module named 'cryoPARES'
Solutions:
Verify installation:
pip list | grep cryoPARES
If using development install, check you’re in the right directory:
cd /path/to/cryoPARES
pip install -e .
Check Python environment:
which python
# Should point to your conda environment
CUDA version mismatch
Symptoms:
RuntimeError: CUDA error: no kernel image is available for execution on the device
Solution:
Reinstall PyTorch with correct CUDA version:
# Check CUDA version
nvidia-smi
# Install matching PyTorch (example for CUDA 11.8)
pip install --force-reinstall numpy scikit-image scipy torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install -e .
Configuration Issues
Type mismatch errors when using –config
Symptoms:
TypeError: argument must be int, not float
ValueError: could not convert string to float
Cause: Parameter types must match exactly when using --config
overrides.
Solution:
Always match the parameter type:
For float parameters - include a decimal point:
# Correct: --config train.learning_rate=1e-3 --config datamanager.particlesDataset.sampling_rate_angs_for_nnet=2.0 # Wrong: --config train.learning_rate=1 # Missing decimal - will fail! --config sampling_rate_angs_for_nnet=2 # Missing decimal - will fail!
For int parameters - do NOT include a decimal point:
# Correct: --config models.image2sphere.lmax=8 --config train.n_epochs=100 # Wrong: --config models.image2sphere.lmax=8.0 # Has decimal - will fail! --config train.n_epochs=100.0 # Has decimal - will fail!
Check parameter types:
python -m cryopares_train --show-config | grep parameter_name # Look for (int) or (float) annotation
File System Issues
“Too many open files” error
Symptoms:
OSError: [Errno 24] Too many open files
Root cause: CryoPARES opens one file handler for each .mrcs
file in the .star file.
Solutions:
Immediate fix: Increase file descriptor limit:
ulimit -n 65536
Permanent fix: Add to
.bashrc
or.bash_profile
:
echo "ulimit -n 65536" >> ~/.bashrc
source ~/.bashrc
If you cannot set it to a large enough number, join your particles stacsk into a smaller number of .mrcs files
System-wide fix (requires root):
# Edit /etc/security/limits.conf
sudo nano /etc/security/limits.conf
# Add these lines:
* soft nofile 65536
* hard nofile 65536
Verify limit:
ulimit -n
# Should show 65536 or higher
Permission denied when writing outputs
Symptoms:
PermissionError: [Errno 13] Permission denied: '/path/to/output'
Solutions:
Check directory permissions:
ls -ld /path/to/output
Create directory with correct permissions:
mkdir -p /path/to/output
chmod 755 /path/to/output
Use a different output directory:
--train_save_dir ~/cryopares_outputs/
Particle files not found
Symptoms:
FileNotFoundError: Particle file not found: /path/to/particles/...
Solutions:
Check
--particles_dir
argument:
# If .star file has relative paths like:
# MotionCorr/job01/particles_001.mrcs
#And the MotionCorr directory is at /path/to/relion/project/
# Use:
--particles_dir /path/to/relion/project/
Verify .star file paths:
head -20 /path/to/particles.star
# Check the _rlnImageName column
Make paths absolute:
import starfile
df = starfile.read('/path/to/particles.star')
df['rlnImageName'] = df['rlnImageName'].apply(
lambda x: f'/absolute/path/{x}'
)
starfile.write(df, 'particles_absolute.star')
Memory Issues
Out of memory (OOM) during training
Symptoms:
RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB
Solutions:
Reduce batch size:
--batch_size 16
You can compensate, up to a certain point, the batch reduction by increasing accumulate_gra_batches to keep the effective batch size (batch_size x accumulate_grad_batches) constant
--config train.accumulate_grad_batches=32
# Maintains effective batch size while reducing memory
Reduce image size:
--config datamanager.particlesDataset.image_size_px_for_nnet=96
and/or
--config datamanager.particlesDataset.sampling_rate_angs_for_nnet=2.0
Note that larger sampling rate implies smaller images. Don’t try to reduce the image size at inference, it won’t work, as the checkpoint is prepared for the training size
Reduce model complexity:
There are several parameters that severely contribute to the model size. Decrease them to have a smaller network.
--config models.image2sphere.lmax=10 #Default is 12
--config models.image2sphere.so3components.i2sprojector.sphere_fdim=256 #Default is 512
--config models.image2sphere.so3components.s2conv.f_out=32 #Default is 64
--config models.image2sphere.imageencoder.out_channels=256
Out of memory during inference
Solutions:
Reduce inference batch size:
--batch_size 16
RAM exhausted when loading data
Symptoms:
MemoryError: Unable to allocate array
Solutions:
Disable in-memory caching:
--config datamanager.particlesDataset.store_data_in_memory=False
Reduce number of workers:
--num_dataworkers 2
Training Issues
Loss becomes NaN
Symptoms:
loss: nan, geo_degs: nan
Causes and solutions:
Learning rate too high:
--config train.learning_rate=1e-4
Numerical instability:
--config train.weight_decay=1e-6 # Reduce regularization
Bad data: Check for corrupted particles or extreme values in .star file
Training doesn’t improve
Symptoms:
Loss plateaus immediately
val_geo_degs
> 30° after many epochs
Diagnostic steps:
Verify data is pre-aligned:
# Check that .star file contains orientation columns:
# _rlnAngleRot, _rlnAngleTilt, _rlnAnglePsi
grep "rlnAngle" /path/to/particles.star
Test overfitting capability:
python -m cryopares_train \
--symmetry C1 \
--particles_star_fname data.star \
--train_save_dir /tmp/overfit_test \
--n_epochs 100 \
--overfit_batches 10
Check symmetry: Wrong symmetry can make training impossible.
Increase learning rate:
--config train.learning_rate=5e-3
Increase model capacity:
--config models.image2sphere.lmax=14
Validation loss higher than training loss
Normal: Small gap (< 20%) is expected and healthy.
Concerning: Large gap (> 50%) indicates overfitting.
Solutions for overfitting:
Increase regularization:
--config train.weight_decay=1e-4
Reduce model complexity:
--config models.image2sphere.lmax=10
More training data: Use more particles in .star file
Check for data leakage: Ensure train/val split is correct
Training crashes with “Killed”
Symptoms: Process killed with no error message.
Cause: System OOM (out of RAM, not GPU memory).
Solutions:
Reduce workers:
--num_dataworkers 2
Reduce batch size:
--batch_size 16
Monitor memory:
watch -n 1 free -h
Checkpoints not saving
Symptoms:
No .ckpt
files in train_save_dir/version_0/half1/checkpoints/
Solutions:
Check disk space:
df -h /path/to/train_save_dir
Check write permissions:
ls -ld /path/to/train_save_dir
Verify training completes at least one epoch: Check logs for validation step completion
Inference Issues
Predicted poses are random
Symptoms:
Angular error > 90°
Reconstruction looks like noise
Causes and solutions:
Wrong checkpoint directory:
# Should point to version_0, not to half1 or half2
--checkpoint_dir /path/to/training/version_0
# NOT: /path/to/training/version_0/half1
Model not trained: Check training logs to verify training completed
Different molecule: Model trained on different protein than inference data
Wrong symmetry: Verify symmetry matches training
Reconstruction is blurry
Causes and solutions:
Insufficient local refinement:
--config projmatching.grid_distance_degs=10.0
Note. Increasing this parameter will make running times and memory consumption massively grow.
Too strict confidence filtering:
--config inference.directional_zscore_thr=1.5
Not enough particles passing filter: Check output .star file size
Wrong reference map: Provide better reference:
--reference_map /path/to/good_reference.mrc
“No particles passed confidence threshold”
Symptoms:
Warning: No particles passed directional_zscore_thr=2.0
Solutions:
Lower threshold:
--config inference.directional_zscore_thr=1.0
Disable filtering:
--config inference.directional_zscore_thr=None
Check if model and data match: Verify you’re using the correct trained model
Inference slower than expected
Solutions:
Increase batch size:
--batch_size 64
But this increases GPU memory consumption
Reduce top_k predictions: If you enabled inference.top_k_poses_nnet>1, then decrease the number
--config inference.top_k_poses_nnet=1
Skip reconstruction if not needed:
--config inference.skip_reconstruction=True
Use GPU:
--config inference.use_cuda=True
Compile model:
--compile_model
Data Issues
CTF parameters missing
Symptoms:
KeyError: '_rlnDefocusU'
Solution:
CryoPARES requires CTF parameters in .star file. Ensure your .star file was generated after CTF estimation (e.g., CTFFIND or Gctf in RELION).
Required CTF columns:
_rlnDefocusU
_rlnDefocusV
_rlnDefocusAngle
_rlnVoltage
_rlnSphericalAberration
_rlnAmplitudeContrast
Orientation columns missing
Symptoms:
KeyError: '_rlnAngleRot'
Solution:
For training, you need pre-aligned particles with:
_rlnAngleRot
_rlnAngleTilt
_rlnAnglePsi
_rlnOriginXAngst
_rlnOriginYAngst
For inference, orientations are optional (will be predicted).
Particles are poorly centered
Symptoms:
High angular errors
Poor reconstruction quality
Solution:
Re-center particles in RELION:
Extract particles with larger box size
Run 2D or 3D classification
Re-extract centered particles
Different sampling rate in data vs. reference
Symptoms:
RuntimeError: Size mismatch between particles and reference
Solution:
CryoPARES handles rescaling automatically, but verify:
Sampling rates are specified in .star file: Check for
_rlnImagePixelSize
or_rlnDetectorPixelSize
Reference map has correct pixel size:
# Check with mrcfile
python -c "import mrcfile; print(mrcfile.open('ref.mrc').voxel_size)"
Performance Issues
Training is very slow
Diagnostic:
Check if GPU is being used:
import torch
print(torch.cuda.is_available()) # Should be True
print(torch.cuda.device_count()) # Should be > 0
Solutions:
Enable model compilation:
--compile_model
Increase batch size:
--batch_size 64
Reduce image size:
--config datamanager.particlesDataset.image_size_px_for_nnet=96
Use multiple GPUs: CryoPARES automatically uses all available GPUs
Reduce model complexity:
--config models.image2sphere.lmax=12
Check data loading is not bottleneck:
--num_dataworkers 8
You might want to use top
/htop
to monitor CPU usage as well as IO. If you see your workers in S status,
it is a sign of IO bottleneck.
Use faster precision:
--config float32_matmul_precision="medium"
Inference is very slow
Solutions:
Increase batch size:
--batch_size 64
Reduce angular search range:
--config projmatching.grid_distance_degs=4.0
Skip local refinement (if acceptable): and reconstruction
--config inference.skip_localrefinement=True inference.skip_reconstruction=True
Use coarser search:
--config projmatching.grid_step_degs=3.0
Data loading is bottleneck
Symptoms:
GPU utilization < 50%
High CPU usage from data workers
Solutions:
Increase workers:
--num_dataworkers 8
Use faster storage: Move data to local SSD instead of network drive
Reduce preprocessing: Ensure images don’t require heavy rescaling
CUDA/GPU Issues
“CUDA out of memory” but GPU seems empty
Cause: Memory fragmentation.
Solutions:
Restart training:
# Clear GPU memory
nvidia-smi
# Kill any zombie processes
Reduce batch size: Even if GPU shows free memory, fragmentation can prevent allocation
Multiple GPUs not being used
Diagnostic:
nvidia-smi
# Only GPU 0 shows activity
Solution:
CryoPARES uses PyTorch Lightning’s automatic multi-GPU training. To force specific GPUs:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m cryopares_train ...
GPU slower than expected
Solutions:
Check GPU is not throttling:
nvidia-smi --query-gpu=temperature.gpu,clocks.current.graphics --format=csv
Ensure not using integrated graphics:
nvidia-smi --query-gpu=name --format=csv
Update NVIDIA drivers:
nvidia-smi
# Check driver version, update if old
Output Quality Issues
Reconstruction has artifacts
Causes and solutions:
Overfitting to noise:
Use
directional_zscore_thr
to filter low-confidence particlesUse matching half-sets (default behavior)
CTF correction issues:
Verify CTF parameters are correct
Mask too tight:
--config datamanager.particlesDataset.mask_radius_angs=150 # Increase
Insufficient particles: Check how many particles passed filtering
FSC is lower than expected
Solutions:
More thorough local refinement:
--config projmatching.grid_distance_degs=10.0 \
projmatching.grid_step_degs=1.0
Filter particles more aggressively:
--config inference.directional_zscore_thr=2.5
Reconstructed map has wrong hand
Solution:
The model learns the hand from training data. If training data had wrong hand, flip reference map:
import mrcfile
vol = mrcfile.open('volume.mrc').data
vol_flipped = vol[:, :, ::-1] # Flip along z
mrcfile.write('volume_flipped.mrc', vol_flipped)
Getting More Help
Enable Debug Mode
For more verbose output:
python -m cryopares_train \
--show_debug_stats \
... other args ...
Check Logs
Training logs are saved to:
train_save_dir/version_0/half*/
Report Issues
If you encounter a bug:
Check existing issues: https://github.com/rsanchezgarc/cryoPARES/issues
Provide:
Full command used
Error message and stack trace
CryoPARES version:
pip show cryopares
PyTorch version:
python -c "import torch; print(torch.__version__)"
CUDA version:
nvidia-smi
Minimal reproducible example: Create smallest possible example that shows the bug
See Also
Training Guide - Training best practices
Configuration Guide - All configuration parameters
API Reference - Detailed API documentation