Overview
Teaching: 15 min
Exercises: 0 minQuestions
How can jobs fail?
How can I figure out what went wrong?
Run test jobs / interactive jobs.
Where can jobs fail? Answer: scheduler/hardware issue, software issue, etc…
Where to find answers: out/error/log files,
How to write a good email to ask for help: describe problem, what you expected, what you saw instead.
callout for common bash scripting error w/ ^M line endings
interactive jobs
The HTCondor log file, as well as the output and error files, can contain valuable information about your jobs
The log file contains information that HTCondor tracks for each job, including when it was submitted, started, and stopped. It also describes resource use, and where the job ran.
000 (16173120.000.000) 03/16 09:50:48 Job submitted from host: <128.104.101.92:9618?addrs=128.104.101.92-9618
001 (16173120.000.000) 03/16 09:53:10 Job executing on host: <128.105.244.92:9618?addrs=128.105.244.92-9618&noUDP&sock=7150_4f71_3>
005 (16173120.000.000) 03/16 09:58:12 Job terminated.
Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 15 1048576 11053994 Memory (MB) : 1 102400 102400
(1) Normal termination (return value 0)A return value of "0" is normal; non-zero values indicate an error.
Job executing on host: <128.105.244.92:You can get the "name" of the machine where a job ran by running the command
host
, followed by the 4-part IP address. Using the above
example, this would look like:
$ host 128.105.244.92
Exercise
run interactive job
Key Points
Always run a test job before submitting a full scale job.
To test a new job, use an interactive session beore submitting.
You can use log, standard output, and standard error information to determine why jobs fail.