Debugging
Debugging distributed systems is notoriously difficult. PX makes it easier by encouraging patterns that are naturally debuggable: single-process, single-threaded programs that operate on partitioned data.
Start Local, Scale Remote
The best way to debug PX jobs is to test locally first:
bash
# Test your script locally first
python process.py < sample_input.txt
# Run locally in parallel
px run -p 4 'python process.py'
# Then scale to the cloud
px run --cluster files -p 16 -a images.txt 'python process.py'If your code works locally on a single input, it should work remotely on partitioned inputs — assuming your code is idempotent and doesn't rely on shared state.
Common Debugging Patterns
1. Test with Small Data First
Don't start by processing 60,000 files. Start with 10:
bash
# Create a small test dataset
mkdir test_data
cp sample_files/*.jpg test_data/
# Test locally
px run -p 2 'python process.py'2. Check Your Input/Output Assumptions
Make sure your code handles:
- Empty inputs gracefully
- File permissions correctly
- Output directories that may not exist
- Partial or malformed data
3. Use Verbose Logging
Add logging to understand what each process is doing:
python
import logging
import os
logging.basicConfig(
level=logging.INFO,
format=f'[{os.getpid()}] %(message)s'
)
logger = logging.getLogger(__name__)
logger.info(f"Processing file: {filename}")When running in parallel, the PID helps you distinguish which process is doing what.
Idempotency Checklist
Your code should be idempotent. Ask yourself:
- ✅ Does my code produce the same output given the same input?
- ✅ Can I re-run this code on the same data without problems?
- ✅ Does my code avoid external state (databases, APIs, global counters)?
- ✅ Are my file writes atomic or to unique output files?
Understanding Process Isolation
Each parallel instance of your code runs in its own Linux process. These processes:
- Have separate memory spaces
- Don't share variables or state
- Each process a partition of the input data
- Write to their own output locations
Think of each process as a completely independent script execution.
When Things Go Wrong
Job Failures
If a job fails, check:
- Exit codes: Did your script exit with a non-zero status?
- Logs: What was printed to stderr?
- Input partitioning: Did some partitions have malformed data?
Partial Results
If you get partial results:
- Check idempotency: Can you safely re-run on all inputs?
- Check output handling: Are you overwriting vs. appending?
- Check file locking: Are concurrent writes causing issues?
Performance Issues
If jobs are slow:
- Profile locally first: Use standard profiling tools
- Check I/O patterns: Are you reading the same file repeatedly?
- Adjust parallelism: Try
-pwith different values - Monitor resource usage: Are you CPU-bound or I/O-bound?
Best Practices
Write Debuggable Code
python
# Good: Clear, single-purpose, debuggable
def process_image(input_path, output_path):
"""Process a single image file."""
img = load_image(input_path)
processed = apply_filter(img)
save_image(processed, output_path)
return output_path
# Avoid: Complex, stateful, hard to debug
class ImageProcessor:
def __init__(self):
self.cache = {}
self.counter = 0
self.db_conn = connect_to_db()
# ... distributed state is hard to debugKeep It Simple
The simpler your code, the easier it is to debug at scale:
- Prefer pure functions over stateful classes
- Avoid global variables and shared state
- Use explicit inputs and outputs
- Log important operations
Test Edge Cases
Test your code with:
- Empty inputs
- Very large inputs
- Malformed data
- Missing files
- Permission issues
If your code handles these locally, it will handle them at scale.