# Troubleshooting Guide This guide covers common issues and their solutions when using CryoPARES. ## Table of Contents - [Installation Issues](#installation-issues) - [Configuration Issues](#configuration-issues) - [File System Issues](#file-system-issues) - [Memory Issues](#memory-issues) - [Training Issues](#training-issues) - [Inference Issues](#inference-issues) - [Data Issues](#data-issues) - [Performance Issues](#performance-issues) - [CUDA/GPU Issues](#cudagpu-issues) - [Output Quality Issues](#output-quality-issues) --- ## Installation Issues ### pip install fails with dependency conflicts **Symptoms:** ``` ERROR: pip's dependency resolver does not currently take into account all the packages ``` **Solution:** 1. Create a fresh conda environment: ```bash conda create -n cryopares_fresh python=3.12 conda activate cryopares_fresh ``` 2. Install in order: ```bash pip install git+https://github.com/rsanchezgarc/cryoPARES.git ``` ### ImportError: No module named 'cryoPARES' **Symptoms:** ```python ImportError: No module named 'cryoPARES' ``` **Solutions:** 1. Verify installation: ```bash pip list | grep cryoPARES ``` 2. If using development install, check you're in the right directory: ```bash cd /path/to/cryoPARES pip install -e . ``` 3. Check Python environment: ```bash which python # Should point to your conda environment ``` ### CUDA version mismatch **Symptoms:** ``` RuntimeError: CUDA error: no kernel image is available for execution on the device ``` **Solution:** Reinstall PyTorch with correct CUDA version: ```bash # Check CUDA version nvidia-smi # Install matching PyTorch (example for CUDA 11.8) pip install --force-reinstall numpy scikit-image scipy torch torchvision --index-url https://download.pytorch.org/whl/cu118 pip install -e . ``` --- ## Configuration Issues ### Type mismatch errors when using --config **Symptoms:** ``` TypeError: argument must be int, not float ValueError: could not convert string to float ``` **Cause:** Parameter types must match exactly when using `--config` overrides. **Solution:** **Always match the parameter type:** 1. **For float parameters** - include a decimal point: ```bash # Correct: --config train.learning_rate=1e-3 --config datamanager.particlesDataset.sampling_rate_angs_for_nnet=2.0 # Wrong: --config train.learning_rate=1 # Missing decimal - will fail! --config sampling_rate_angs_for_nnet=2 # Missing decimal - will fail! ``` 2. **For int parameters** - do NOT include a decimal point: ```bash # Correct: --config models.image2sphere.lmax=8 --config train.n_epochs=100 # Wrong: --config models.image2sphere.lmax=8.0 # Has decimal - will fail! --config train.n_epochs=100.0 # Has decimal - will fail! ``` 3. **Check parameter types:** ```bash python -m cryopares_train --show-config | grep parameter_name # Look for (int) or (float) annotation ``` --- ## File System Issues ### "Too many open files" error **Symptoms:** ``` OSError: [Errno 24] Too many open files ``` **Root cause:** CryoPARES opens one file handler for each `.mrcs` file in the .star file. **Solutions:** 1. **Immediate fix:** Increase file descriptor limit: ```bash ulimit -n 65536 ``` 2. **Permanent fix:** Add to `.bashrc` or `.bash_profile`: ```bash echo "ulimit -n 65536" >> ~/.bashrc source ~/.bashrc ``` If you cannot set it to a large enough number, join your particles stacsk into a smaller number of .mrcs files 3. **System-wide fix** (requires root): ```bash # Edit /etc/security/limits.conf sudo nano /etc/security/limits.conf # Add these lines: * soft nofile 65536 * hard nofile 65536 ``` 4. **Verify limit:** ```bash ulimit -n # Should show 65536 or higher ``` ### Permission denied when writing outputs **Symptoms:** ``` PermissionError: [Errno 13] Permission denied: '/path/to/output' ``` **Solutions:** 1. Check directory permissions: ```bash ls -ld /path/to/output ``` 2. Create directory with correct permissions: ```bash mkdir -p /path/to/output chmod 755 /path/to/output ``` 3. Use a different output directory: ```bash --train_save_dir ~/cryopares_outputs/ ``` ### Particle files not found **Symptoms:** ``` FileNotFoundError: Particle file not found: /path/to/particles/... ``` **Solutions:** 1. **Check `--particles_dir` argument:** ```bash # If .star file has relative paths like: # MotionCorr/job01/particles_001.mrcs #And the MotionCorr directory is at /path/to/relion/project/ # Use: --particles_dir /path/to/relion/project/ ``` 2. **Verify .star file paths:** ```bash head -20 /path/to/particles.star # Check the _rlnImageName column ``` 3. **Make paths absolute:** ```python import starfile df = starfile.read('/path/to/particles.star') df['rlnImageName'] = df['rlnImageName'].apply( lambda x: f'/absolute/path/{x}' ) starfile.write(df, 'particles_absolute.star') ``` --- ## Memory Issues ### Out of memory (OOM) during training **Symptoms:** ``` RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB ``` **Solutions:** 1. **Reduce batch size:** ```bash --batch_size 16 ``` You can compensate, up to a certain point, the batch reduction by increasing accumulate_gra_batches to keep the effective batch size (batch_size x accumulate_grad_batches) constant ```bash --config train.accumulate_grad_batches=32 # Maintains effective batch size while reducing memory ``` 2. **Reduce image size:** ```bash --config datamanager.particlesDataset.image_size_px_for_nnet=96 ``` and/or ```bash --config datamanager.particlesDataset.sampling_rate_angs_for_nnet=2.0 ``` Note that larger sampling rate implies smaller images. Don't try to reduce the image size at inference, it won't work, as the checkpoint is prepared for the training size 3. **Reduce model complexity:** There are several parameters that severely contribute to the model size. Decrease them to have a smaller network. ```bash --config models.image2sphere.lmax=10 #Default is 12 ``` ```bash --config models.image2sphere.so3components.i2sprojector.sphere_fdim=256 #Default is 512 ``` ```bash --config models.image2sphere.so3components.s2conv.f_out=32 #Default is 64 ``` ```bash --config models.image2sphere.imageencoder.out_channels=256 ``` ### Out of memory during inference **Solutions:** 1. **Reduce inference batch size:** ```bash --batch_size 16 ``` ### RAM exhausted when loading data **Symptoms:** ``` MemoryError: Unable to allocate array ``` **Solutions:** 1. **Disable in-memory caching:** ```bash --config datamanager.particlesDataset.store_data_in_memory=False ``` 2. **Reduce number of workers:** ```bash --num_dataworkers 2 ``` ## Training Issues ### Loss becomes NaN **Symptoms:** ``` loss: nan, geo_degs: nan ``` **Causes and solutions:** 1. **Learning rate too high:** ```bash --config train.learning_rate=1e-4 ``` 2. **Numerical instability:** ```bash --config train.weight_decay=1e-6 # Reduce regularization ``` 3. **Bad data:** Check for corrupted particles or extreme values in .star file ### Training doesn't improve **Symptoms:** - Loss plateaus immediately - `val_geo_degs` > 30° after many epochs **Diagnostic steps:** 1. **Verify data is pre-aligned:** ```bash # Check that .star file contains orientation columns: # _rlnAngleRot, _rlnAngleTilt, _rlnAnglePsi grep "rlnAngle" /path/to/particles.star ``` 2. **Test overfitting capability:** ```bash python -m cryopares_train \ --symmetry C1 \ --particles_star_fname data.star \ --train_save_dir /tmp/overfit_test \ --n_epochs 100 \ --overfit_batches 10 ``` 3. **Check symmetry:** Wrong symmetry can make training impossible. 4. **Increase learning rate:** ```bash --config train.learning_rate=5e-3 ``` 5. **Increase model capacity:** ```bash --config models.image2sphere.lmax=14 ``` ### Validation loss higher than training loss **Normal:** Small gap (< 20%) is expected and healthy. **Concerning:** Large gap (> 50%) indicates overfitting. **Solutions for overfitting:** 1. **Increase regularization:** ```bash --config train.weight_decay=1e-4 ``` 2. **Reduce model complexity:** ```bash --config models.image2sphere.lmax=10 ``` 3. **More training data:** Use more particles in .star file 4. **Check for data leakage:** Ensure train/val split is correct ### Training crashes with "Killed" **Symptoms:** Process killed with no error message. **Cause:** System OOM (out of RAM, not GPU memory). **Solutions:** 1. **Reduce workers:** ```bash --num_dataworkers 2 ``` 2. **Reduce batch size:** ```bash --batch_size 16 ``` 3. **Monitor memory:** ```bash watch -n 1 free -h ``` ### Checkpoints not saving **Symptoms:** No `.ckpt` files in `train_save_dir/version_0/half1/checkpoints/` **Solutions:** 1. **Check disk space:** ```bash df -h /path/to/train_save_dir ``` 2. **Check write permissions:** ```bash ls -ld /path/to/train_save_dir ``` 3. **Verify training completes at least one epoch:** Check logs for validation step completion --- ## Inference Issues ### Predicted poses are random **Symptoms:** - Angular error > 90° - Reconstruction looks like noise **Causes and solutions:** 1. **Wrong checkpoint directory:** ```bash # Should point to version_0, not to half1 or half2 --checkpoint_dir /path/to/training/version_0 # NOT: /path/to/training/version_0/half1 ``` 2. **Model not trained:** Check training logs to verify training completed 3. **Different molecule:** Model trained on different protein than inference data 4. **Wrong symmetry:** Verify symmetry matches training ### Reconstruction is blurry **Causes and solutions:** 1. **Insufficient local refinement:** ```bash --config projmatching.grid_distance_degs=10.0 ``` Note. Increasing this parameter will make running times and memory consumption massively grow. 2. **Too strict confidence filtering:** ```bash --config inference.directional_zscore_thr=1.5 ``` 3. **Not enough particles passing filter:** Check output .star file size 4. **Wrong reference map:** Provide better reference: ```bash --reference_map /path/to/good_reference.mrc ``` ### "No particles passed confidence threshold" **Symptoms:** ``` Warning: No particles passed directional_zscore_thr=2.0 ``` **Solutions:** 1. **Lower threshold:** ```bash --config inference.directional_zscore_thr=1.0 ``` 2. **Disable filtering:** ```bash --config inference.directional_zscore_thr=None ``` 3. **Check if model and data match:** Verify you're using the correct trained model ### Inference slower than expected **Solutions:** 1. **Increase batch size:** ```bash --batch_size 64 ``` But this increases GPU memory consumption 2. **Reduce top_k predictions:** If you enabled inference.top_k_poses_nnet>1, then decrease the number 3. ```bash --config inference.top_k_poses_nnet=1 ``` 3. **Skip reconstruction if not needed:** ```bash --config inference.skip_reconstruction=True ``` 4. **Use GPU:** ```bash --config inference.use_cuda=True ``` 5. **Compile model:** ```bash --compile_model ``` --- ## Data Issues ### CTF parameters missing **Symptoms:** ``` KeyError: '_rlnDefocusU' ``` **Solution:** CryoPARES requires CTF parameters in .star file. Ensure your .star file was generated after CTF estimation (e.g., CTFFIND or Gctf in RELION). Required CTF columns: - `_rlnDefocusU` - `_rlnDefocusV` - `_rlnDefocusAngle` - `_rlnVoltage` - `_rlnSphericalAberration` - `_rlnAmplitudeContrast` ### Orientation columns missing **Symptoms:** ``` KeyError: '_rlnAngleRot' ``` **Solution:** For **training**, you need pre-aligned particles with: - `_rlnAngleRot` - `_rlnAngleTilt` - `_rlnAnglePsi` - `_rlnOriginXAngst` - `_rlnOriginYAngst` For **inference**, orientations are optional (will be predicted). ### Particles are poorly centered **Symptoms:** - High angular errors - Poor reconstruction quality **Solution:** Re-center particles in RELION: 1. Extract particles with larger box size 2. Run 2D or 3D classification 3. Re-extract centered particles ### Different sampling rate in data vs. reference **Symptoms:** ``` RuntimeError: Size mismatch between particles and reference ``` **Solution:** CryoPARES handles rescaling automatically, but verify: 1. **Sampling rates are specified in .star file:** Check for `_rlnImagePixelSize` or `_rlnDetectorPixelSize` 2. **Reference map has correct pixel size:** ```bash # Check with mrcfile python -c "import mrcfile; print(mrcfile.open('ref.mrc').voxel_size)" ``` --- ## Performance Issues ### Training is very slow **Diagnostic:** Check if GPU is being used: ```python import torch print(torch.cuda.is_available()) # Should be True print(torch.cuda.device_count()) # Should be > 0 ``` **Solutions:** 1. **Enable model compilation:** ```bash --compile_model ``` 2. **Increase batch size:** ```bash --batch_size 64 ``` 3. **Reduce image size:** ```bash --config datamanager.particlesDataset.image_size_px_for_nnet=96 ``` 4. **Use multiple GPUs:** CryoPARES automatically uses all available GPUs 5. **Reduce model complexity:** ```bash --config models.image2sphere.lmax=12 ``` 6. **Check data loading is not bottleneck:** ```bash --num_dataworkers 8 ``` You might want to use `top`/`htop` to monitor CPU usage as well as IO. If you see your workers in S status, it is a sign of IO bottleneck. 7. **Use faster precision:** ```bash --config float32_matmul_precision="medium" ``` ### Inference is very slow **Solutions:** 1. **Increase batch size:** ```bash --batch_size 64 ``` 2. **Reduce angular search range:** ```bash --config projmatching.grid_distance_degs=4.0 ``` 3. **Skip local refinement (if acceptable): and reconstruction** ```bash --config inference.skip_localrefinement=True inference.skip_reconstruction=True ``` 4. **Use coarser search:** ```bash --config projmatching.grid_step_degs=3.0 ``` ### Data loading is bottleneck **Symptoms:** - GPU utilization < 50% - High CPU usage from data workers **Solutions:** 1. **Increase workers:** ```bash --num_dataworkers 8 ``` 2. **Use faster storage:** Move data to local SSD instead of network drive 3. **Reduce preprocessing:** Ensure images don't require heavy rescaling --- ## CUDA/GPU Issues ### "CUDA out of memory" but GPU seems empty **Cause:** Memory fragmentation. **Solutions:** 1. **Restart training:** ```bash # Clear GPU memory nvidia-smi # Kill any zombie processes ``` 2. **Reduce batch size:** Even if GPU shows free memory, fragmentation can prevent allocation ### Multiple GPUs not being used **Diagnostic:** ```bash nvidia-smi # Only GPU 0 shows activity ``` **Solution:** CryoPARES uses PyTorch Lightning's automatic multi-GPU training. To force specific GPUs: ```bash CUDA_VISIBLE_DEVICES=0,1,2,3 python -m cryopares_train ... ``` ### GPU slower than expected **Solutions:** 1. **Check GPU is not throttling:** ```bash nvidia-smi --query-gpu=temperature.gpu,clocks.current.graphics --format=csv ``` 2. **Ensure not using integrated graphics:** ```bash nvidia-smi --query-gpu=name --format=csv ``` 3. **Update NVIDIA drivers:** ```bash nvidia-smi # Check driver version, update if old ``` --- ## Output Quality Issues ### Reconstruction has artifacts **Causes and solutions:** 1. **Overfitting to noise:** - Use `directional_zscore_thr` to filter low-confidence particles - Use matching half-sets (default behavior) 2. **CTF correction issues:** - Verify CTF parameters are correct 3. **Mask too tight:** ```bash --config datamanager.particlesDataset.mask_radius_angs=150 # Increase ``` 4. **Insufficient particles:** Check how many particles passed filtering ### FSC is lower than expected **Solutions:** 1. **More thorough local refinement:** ```bash --config projmatching.grid_distance_degs=10.0 \ projmatching.grid_step_degs=1.0 ``` 2. **Filter particles more aggressively:** ```bash --config inference.directional_zscore_thr=2.5 ``` ### Reconstructed map has wrong hand **Solution:** The model learns the hand from training data. If training data had wrong hand, flip reference map: ```python import mrcfile vol = mrcfile.open('volume.mrc').data vol_flipped = vol[:, :, ::-1] # Flip along z mrcfile.write('volume_flipped.mrc', vol_flipped) ``` --- ## Getting More Help ### Enable Debug Mode For more verbose output: ```bash python -m cryopares_train \ --show_debug_stats \ ... other args ... ``` ### Check Logs Training logs are saved to: ``` train_save_dir/version_0/half*/ ``` ### Report Issues If you encounter a bug: 1. **Check existing issues:** https://github.com/rsanchezgarc/cryoPARES/issues 2. **Provide:** - Full command used - Error message and stack trace - CryoPARES version: `pip show cryopares` - PyTorch version: `python -c "import torch; print(torch.__version__)"` - CUDA version: `nvidia-smi` 3. **Minimal reproducible example:** Create smallest possible example that shows the bug --- ## See Also - [Training Guide](training_guide.md) - Training best practices - [Configuration Guide](configuration_guide.md) - All configuration parameters - [API Reference](https://rsanchezgarc.github.io/cryoPARES/api/) - Detailed API documentation