Назад | Перейти на главную страницу

момент pbs 4.0.1 задание остается в очереди ('Q'); планировщик не получает никаких уведомлений

Я использую крутящий момент 4.0.1 в openSUSE 12.1 в кластерной среде. Когда я выполняю qsub задание (просто «эхо-привет»), оно остается в состоянии «Q» и никогда не будет запланировано. Я могу принудительно запустить задание с помощью qrun, и оно будет выполнено на первом узле без ошибок.

Я пытался найти решения в течение последних нескольких дней, но не смог. Я прочитал руководство, журналы и даже исходный код, но все еще не могу определить причину проблемы. Конечно, я много гуглил, пробовал разные решения, но никто не работал.

Вот некоторая информация, которая может оказаться полезной:


    05/13/2012 18:55:08;0002; pbs_sched;Svr;Log;Log opened
    05/13/2012 18:55:08;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120513 opened
    05/13/2012 18:55:08;0002; pbs_sched;Svr;main;pbs_sched startup pid 32604

    05/13/2012 19:33:08;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.1, loglevel = 0
    05/13/2012 19:33:56;0100;PBS_Server;Job;16.head;enqueuing into batch, state 1 hop 1
    05/13/2012 19:33:56;0008;PBS_Server;Job;16.head;Job Queued at request of pubuser@head, owner = pubuser@head, job name = STDIN, queue = batch

    Job Id: 16.head
    Job_Name = STDIN
    Job_Owner = pubuser@head
    job_state = Q
    queue = batch
    server = head
    Checkpoint = u
    ctime = Sun May 13 19:33:56 2012
    Error_Path = head:/fserver/home/pubuser/STDIN.e16
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Sun May 13 19:33:56 2012
    Output_Path = head:/fserver/home/pubuser/STDIN.o16
    Priority = 0
    qtime = Sun May 13 19:33:56 2012
    Rerunable = True
    Resource_List.walltime = 01:00:00
    substate = 10
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/,
        PBS_O_WORKDIR=/fserver/home/pubuser,PBS_O_HOST=head,PBS_O_SERVER=head,
        PBS_O_WORKDIR=/fserver/home/pubuser
    euser = pubuser
    egroup = users
    queue_rank = 4
    queue_type = E
    etime = Sun May 13 19:33:56 2012
    fault_tolerant = False
    job_radix = 0
    submit_host = head
    init_work_dir = /fserver/home/pubuser

    sun1
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910403,varattr=,jobs=,state=free,netload=44492032184,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1697420kb,totmem=1802616kb,idletime=241085,nusers=0,nsessions=0,uname=Linux sun1 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun2
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910408,varattr=,jobs=,state=free,netload=39762812881,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1701012kb,totmem=1802616kb,idletime=239982,nusers=0,nsessions=0,uname=Linux sun2 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun3
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910400,varattr=,jobs=,state=free,netload=45984311925,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1699772kb,totmem=1802616kb,idletime=212303,nusers=0,nsessions=0,uname=Linux sun3 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun4
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910407,varattr=,jobs=,state=free,netload=37538584401,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805480kb,totmem=1908308kb,idletime=211197,nusers=0,nsessions=0,uname=Linux sun4 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun5
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910411,varattr=,jobs=,state=free,netload=173547166,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803816kb,totmem=1908308kb,idletime=211199,nusers=0,nsessions=0,uname=Linux sun5 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun6
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910411,varattr=,jobs=,state=free,netload=24641446,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805704kb,totmem=1908308kb,idletime=212999,nusers=0,nsessions=0,uname=Linux sun6 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun7
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910412,varattr=,jobs=,state=free,netload=1548383055,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805432kb,totmem=1908308kb,idletime=215630,nusers=0,nsessions=0,uname=Linux sun7 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun8
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910400,varattr=,jobs=,state=free,netload=128755968,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803448kb,totmem=1908308kb,idletime=211866,nusers=0,nsessions=0,uname=Linux sun8 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun9
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910374,varattr=,jobs=,state=free,netload=1371896399,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805664kb,totmem=1908308kb,idletime=211161,nusers=0,nsessions=0,uname=Linux sun9 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    #
    # Create queues and set their attributes.
    #
    #
    # Create and define queue batch
    #
    create queue batch

    set queue batch queue_type = Execution

    set queue batch resources_default.walltime = 01:00:00

    set queue batch enabled = True

    set queue batch started = True

    #
    # Set server attributes.
    #
    set server scheduling = True

    set server acl_hosts = head

    set server managers = pubuser@head

    set server managers += root@head

    set server operators = pubuser@head

    set server operators += root@head

    set server default_queue = batch

    set server log_events = 511

    set server mail_from = adm

    set server scheduler_iteration = 600

    set server node_check_rate = 150

    set server tcp_timeout = 300

    set server job_stat_rate = 45

    set server poll_jobs = True

    set server mom_job_sync = True

    set server keep_completed = 0

    set server submit_hosts = head

    set server next_job_number = 17

    set server moab_array_compatible = True

Host: sun1/sun1   Version: 4.0.1   PID: 5362
Server[0]: head (192.168.0.1:15001)
  Last Msg From Server:   1584 seconds (DeleteJob)
  Last Msg To Server:     7 seconds
HomeDirectory:          /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (4457492 blocks available)
MOM active:             229485 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
MemLocked:              TRUE  (mlock)
TCP Timeout:            0 seconds
Trusted Client List:  127.0.0.1:0,192.168.0.1:0,192.168.0.101:0,192.168.0.101:15003,192.168.0.102:15003,192.168.0.103:15003,192.168.0.104:15003,192.168.0.105:15003,192.168.0.106:15003,192.168.0.107:15003,192.168.0.108:15003,192.168.0.109:15003:  0
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete

Проблема в том, что TCP Timeout равен 0 секундам, что не кажется нормальным. Во время диагностики в mom_logs был обнаружен следующий журнал


05/13/2012 20:30:10;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Resource temporarily unavailable (11) in tcp_read_proto_version, no protocol version number End of File (errno 2)

Я погуглил, но ничего не нашел.

Надеюсь, кто-нибудь сможет решить эту проблему. Спасибо!