Skip to content

Debugging

Debugging distributed systems is notoriously difficult. PX makes it easier by encouraging patterns that are naturally debuggable: single-process, single-threaded programs that operate on partitioned data.

Start Local, Scale Remote

The best way to debug PX jobs is to test locally first:

bash
# Test your script locally first
python process.py < sample_input.txt

# Run locally in parallel
px run -p 4 'python process.py'

# Then scale to the cloud
px run --cluster files -p 16 -a images.txt 'python process.py'

If your code works locally on a single input, it should work remotely on partitioned inputs — assuming your code is idempotent and doesn't rely on shared state.

Common Debugging Patterns

1. Test with Small Data First

Don't start by processing 60,000 files. Start with 10:

bash
# Create a small test dataset
mkdir test_data
cp sample_files/*.jpg test_data/

# Test locally
px run -p 2 'python process.py'

2. Check Your Input/Output Assumptions

Make sure your code handles:

  • Empty inputs gracefully
  • File permissions correctly
  • Output directories that may not exist
  • Partial or malformed data

3. Use Verbose Logging

Add logging to understand what each process is doing:

python
import logging
import os

logging.basicConfig(
    level=logging.INFO,
    format=f'[{os.getpid()}] %(message)s'
)

logger = logging.getLogger(__name__)
logger.info(f"Processing file: {filename}")

When running in parallel, the PID helps you distinguish which process is doing what.

Idempotency Checklist

Your code should be idempotent. Ask yourself:

  • ✅ Does my code produce the same output given the same input?
  • ✅ Can I re-run this code on the same data without problems?
  • ✅ Does my code avoid external state (databases, APIs, global counters)?
  • ✅ Are my file writes atomic or to unique output files?

Understanding Process Isolation

Each parallel instance of your code runs in its own Linux process. These processes:

  • Have separate memory spaces
  • Don't share variables or state
  • Each process a partition of the input data
  • Write to their own output locations

Think of each process as a completely independent script execution.

When Things Go Wrong

Job Failures

If a job fails, check:

  1. Exit codes: Did your script exit with a non-zero status?
  2. Logs: What was printed to stderr?
  3. Input partitioning: Did some partitions have malformed data?

Partial Results

If you get partial results:

  1. Check idempotency: Can you safely re-run on all inputs?
  2. Check output handling: Are you overwriting vs. appending?
  3. Check file locking: Are concurrent writes causing issues?

Performance Issues

If jobs are slow:

  1. Profile locally first: Use standard profiling tools
  2. Check I/O patterns: Are you reading the same file repeatedly?
  3. Adjust parallelism: Try -p with different values
  4. Monitor resource usage: Are you CPU-bound or I/O-bound?

Best Practices

Write Debuggable Code

python
# Good: Clear, single-purpose, debuggable
def process_image(input_path, output_path):
    """Process a single image file."""
    img = load_image(input_path)
    processed = apply_filter(img)
    save_image(processed, output_path)
    return output_path

# Avoid: Complex, stateful, hard to debug
class ImageProcessor:
    def __init__(self):
        self.cache = {}
        self.counter = 0
        self.db_conn = connect_to_db()
    # ... distributed state is hard to debug

Keep It Simple

The simpler your code, the easier it is to debug at scale:

  • Prefer pure functions over stateful classes
  • Avoid global variables and shared state
  • Use explicit inputs and outputs
  • Log important operations

Test Edge Cases

Test your code with:

  • Empty inputs
  • Very large inputs
  • Malformed data
  • Missing files
  • Permission issues

If your code handles these locally, it will handle them at scale.