Computing at Scale

Troubleshooting

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • How can jobs fail?

  • How can I figure out what went wrong?

Run test jobs / interactive jobs.

Where can jobs fail? Answer: scheduler/hardware issue, software issue, etc…

Where to find answers: out/error/log files,

How to write a good email to ask for help: describe problem, what you expected, what you saw instead.

callout for common bash scripting error w/ ^M line endings

On our system

interactive jobs

The HTCondor log file, as well as the output and error files, can contain valuable information about your jobs

The log file contains information that HTCondor tracks for each job, including when it was submitted, started, and stopped. It also describes resource use, and where the job ran.

Exercise

run interactive job

Key Points