Quantcast
Channel: Clusters and HPC Technology
Viewing all 952 articles
Browse latest View live

integration problem between Torque 4 and Intel(R) MPI Library for Linux* OS, Version 2019 Update 1

$
0
0

Hi!

I have successfully compiled and linked a program with IntelMPI and if I run it interactively or in background it runs very fast and without any problems on our new server (ProLiant DL580 Gen10, 1 node with 4 processors with 18 cores each, total 72 cores, hyperthreading disabled). If I try to submit it by Torque (version 4) strange things happen, for example:

1) if I submit 2 jobs asking each 8 cores they are both fine

2) if I submit a third job (8 cores) it is 4 times slower becasue the 8 process runs on two cores!

3) if I submit a fourth job it runs properly, but if I qdel all the four jobs, all of them disappear from qstat -a but the fourth is keeping running!

From previous discussion I notice in this forum, I have the feeling it is an integration problem between intelmpi and torque, so I did the following:

 export I_MPI_PIN=off
 export I_MPI_PIN_DOMAIN=socket

to run the program I did the following call of mpirun:

/opt/intel/compilers_and_libraries_2019.1.144/linux/mpi/intel64/bin/mpirun -d -rmk pbs -bootstrap pbsdsh .................

I have checked and PBS_ENVIRONMENT is properly set to PBS_BATCH

Also torque configuration is apparently correct, the file

/var/lib/torque/server_priv/nodes contains the following line:

dscfbeta1.units.it np=72 num_node_boards=1

This is a severe problem for me, since the machine is shared so we do need a scheduler like torque (pbs) to run jobs compiled and linked to intelmpi. Any help suggestion is welcome!

thank you in advance

Mauro


Conda impi_rt=2019.1 doesn't substitute I_MPI_ROOT in bin/mpivars.sh

$
0
0

Not sure where to report this bug, but this forces me stick with intelpython 2018.0.3. The steps to reproduce are

conda config --add channels intel
conda create -n test impi_rt=2019.1

You will find /path/to/envs/test/bin/mpivars.sh has I_MPI_ROOT not substituted correctly.

Or is conda no longer the supported way to install Intel Performance Libraries? If so, what's the most future proof way? Or if it is the best way, where should I report this bug? Thanks.

integer overflow for MPI_COMM_WORLD ref-counting in MPI_Iprobe

$
0
0

Calling 2^31 times MPI_Iprobe results in the following error:

Abort(201962501) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Iprobe: Invalid communicator, error stack:
PMPI_Iprobe(123): MPI_Iprobe(src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, MPI_COMM_WORLD, flag=0x7ffd925056c0, status=0x7ffd92505694) failed
PMPI_Iprobe(90).: Invalid communicator

On our system it takes about 10 Minutes to perform this number of calls in a loop.

The affected version is IntelMPI 2019.1.144 (based on MPICH 3.3)
 

The expected behavior is that MPI_Iprobe is neutral for the reference count of the provided communicator. Especially for MPI_COMM_WORLD, the reference count is superflous.

Bad Termination Error Exit Code 4

$
0
0

Hi,

I have a binary which was compiled on Haswell using Intel 16.0 and IMPI 5.1.1. It runs fine on Haswell. But when I try to run it on Skylake nodes, the binary crashed right away with this error

==================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 99283 RUNNING AT iforge127

=   EXIT CODE: 4

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

I understand the issue may be with the application,  but would like to know how to debug this and resolve the issue. Thank you for the help.

Regards,

Intel MPI with Distributed Ansys Mechanical

$
0
0

Did anyone can share a successful story of running distributed Ansys (Mechanical) with intel MPI in windows 10 between two pcs?

A long story to short, I could launch a distributed analysis in a single pc with intel mpi, but couldn't launch a distributed analysis between two pcs, but IBM mpi does.

Here are what I did so far (and wish I can get guides from you)

Hardware: Two Dell workstations, same cpu, ram, and everything.

OS: windows 10

Intel mpi library: 2017 update 3

After installing Intel mpi library and finishing setting up environmental varies and catch password on each machine, I did a test "mpiexec -n 4 -ppn 2 -machine machines.txt test" , and get the following feedback which indicates intel mpi is communicated successfully between two pcs.

Hello world: rank 0 of 4 running on node1
Hello world: rank 1 of 4 running on node2
Hello world: rank 2 of 4 running on node1
Hello world: rank 3 of 4 running on node2

Did the same test on each pc with the command "ansys192 -np 2 -mpitest", and both pcs show "MPI Test has completed successfully!"

However, when I run the distributed test "ansys192 -machine machines.txt -mpitest", it looks like Ansys still takes the test as a single pc test, as the info shown below:

Mechanical APDL execution Command: mpiexec -np 2 -genvlist ANS_USER_PATH,ANSWAIT,ANSYS_SYSDIR,ANSYS_SYSDIR32,ANSYS192_DIR,ANSYSLI_RESERVE_ID,ANSYSLI_USAGE,AWP_LOCALE192,AWP_ROOT192,CADOE_DOCDIR192,CADOE_LIBDIR192,LSTC_LICENSE,P_SCHEMA,PATH,I_MPI_COLL_INTRANODE,I_MPI_AUTH_METHOD  -localroot "C:\Program Files\ANSYS Inc\v192\ANSYS\bin\winx64\MPITESTINTELMPI.EXE"  -machine machines.txt -mpitest

I appreciate all your feedback, Thank you! 

How should I edit machines.LINUX file for my cluster?

$
0
0

Hello everybody:

I am a new user for cluster, recently I updated intel composer XE 2013 to compile fortran,

I found in Readme.txt which says I need a machines.LINUX file to make sure I can use every node to run fortran program.

How should I edit the machines.LINUX file correctly? I had found some example, e.g.

BASH: cluster_prereq_is_remote_dir_mounted(): compute-11-37 <- /opt/intel -> compute-12-26
BASH: cluster_prereq_is_remote_dir_mounted(): compute-11-37 <- /opt/intel -> compute-12-27
BASH: cluster_prereq_is_remote_dir_mounted(): compute-11-37 <- /opt/intel -> compute-12-28
...

or

clusternode01

clusternode02

clusternode03

...

 

what is the format correct?I am very confuse about that, please help me, thanks so much!!

MPI Crashing

$
0
0

Hello,

I recently upgraded my os to Ubunto 18.04 and I have problems since.

Right now I reformatted my desktop and installed a fresh version of Ubuntu 18.04 and Installed intel C++ compiler and MPI library 2019 version 2.

When I run my codes, after a couple of hours and thousands of time steps I get the following error message:

 

Abort(873060101) on node 15 (rank 15 in comm 0): Fatal error in PMPI_Recv: Invalid communicator, error stack:
PMPI_Recv(171): MPI_Recv(buf=0x4b46a00, count=36912, MPI_DOUBLE, src=14, tag=25, MPI_COMM_WORLD, status=0x1) failed
PMPI_Recv(103): Invalid communicator
[cli_15]: readline failed

 

My code used to run fine on Ubuntu 16.04 (with older version of Intel's compiler and MPI), and also runs well on various big clusters.

My code uses Isend for sending information and Recv for reciving. Throughout my code I only use MPI_COMM_WORLD communicator and I never create a new one.

Can you pls help me find out what's wrong?

 

Thank you,

 

Elad

 

intel mpi crash at many ranks

$
0
0

Hi,

We're testing intel mpi (intel19, patch1) on CentOS7.5 - it is a Linux cluster with infiniband network.

Testing intel mpi benchmark, found that it works good for small scales (400 mpi ranks using 10nodes) but for larger scales like 100 nodes  (100*40 = 4000 mpi ranks), it crashes yielding message shown in the bottom.. I recompiled libopenfabric but it doesn't improve the situation. I_MPI_DEBUG 5 doesn't give us the details either - would there be any way to track the cause of crash? fi_info results shown below for reference. Any comments are appreciated.

Thanks,

BJ

PS1.

$ fi_info 
provider: verbs;ofi_rxm
    fabric: IB-0xfe80000000000000
    domain: mlx5_0
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXM
provider: verbs
    fabric: IB-0xfe80000000000000
    domain: mlx5_0
    version: 1.0
    type: FI_EP_MSG
    protocol: FI_PROTO_RDMA_CM_IB_RC
provider: verbs
    fabric: IB-0xfe80000000000000
    domain: mlx5_0-dgram
    version: 1.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_IB_UD
 

PS2. Crash message ( mpirun  -np 4000 -genv I_MPI_DEBUG 5  -machinefile hosts ./IMB-EXT ) 

[proxy:0:

# Bidir_Get
# Bidir_Put
# Accumulate
Abort(743005711) on node 3856 (rank 3856 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3856, new_comm=0x27f9f44) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3856]: readline failed
Abort(407461391) on node 3872 (rank 3872 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3872, new_comm=0xc81e9e4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(407461391) on node 2782 (rank 2782 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2782, new_comm=0x1e978b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(1011441167) on node 3906 (rank 3906 in comm 0): Fatal error in PMPI_Comm_s
plit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3906, new_comm=0xb944f14) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3906]: readline failed
Abort(810114575) on node 3907 (rank 3907 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3907, new_comm=0xc1eff14) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(608787983) on node 3306 (rank 3306 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3306, new_comm=0x2014034) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3306]: readline failed
Abort(541679119) on node 2542 (rank 2542 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2542, new_comm=0x2aeb954) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2542]: readline failed
Abort(743005711) on node 3380 (rank 3380 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3380, new_comm=0x1879a04) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3380]: readline failed
Abort(273243663) on node 3782 (rank 3782 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3782, new_comm=0x257b8e4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(4808207) on node 1072 (rank 1072 in comm 0): Fatal error in PMPI_Comm_spli
t: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=1072, new_comm=0x1e9d794) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_1072]: readline failed
[cli_3782]: readline failed
Abort(273243663) on node 1664 (rank 1664 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=1664, new_comm=0xb14a534) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Connection timed out)
[cli_1664]: readline failed
Abort(71917071) on node 2942 (rank 2942 in comm 0): Fatal error in PMPI_Comm_spl
it: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2942, new_comm=0x28c68b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2942]: readline failed
Abort(474570255) on node 2958 (rank 2958 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2958, new_comm=0x2527ff4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(474570255) on node 3552 (rank 3552 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3552, new_comm=0xc3fa3e4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3552]: readline failed
Abort(139025935) on node 3630 (rank 3630 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3630, new_comm=0x1859ff4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3630]: readline failed
Abort(541679119) on node 3634 (rank 3634 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3634, new_comm=0x1d2d8b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(474570255) on node 2822 (rank 2822 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2822, new_comm=0x32a68b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2822]: readline failed
Abort(474570255) on node 2704 (rank 2704 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2704, new_comm=0xbf11584) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2704]: readline failed
Abort(810114575) on node 2100 (rank 2100 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=2100, new_comm=0x141f8b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_2100]: readline failed
Abort(1011441167) on node 3348 (rank 3348 in comm 0): Fatal error in PMPI_Comm_s
plit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3348, new_comm=0xb111504) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3348]: readline failed
Abort(4808207) on node 3446 (rank 3446 in comm 0): Fatal error in PMPI_Comm_spli
t: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3446, new_comm=0x2c95724) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3446]: readline failed
Abort(608787983) on node 3450 (rank 3450 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3450, new_comm=0x2c84724) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(340352527) on node 3824 (rank 3824 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3824, new_comm=0x1b4a8a4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3824]: readline failed
Abort(474570255) on node 3937 (rank 3937 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3937, new_comm=0xb9e7eb4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3937]: readline failed
Abort(340352527) on node 3979 (rank 3979 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3979, new_comm=0xbea1d94) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3979]: readline failed
Abort(810114575) on node 3826 (rank 3826 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3826, new_comm=0x32af8b4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(1011441167) on node 3982 (rank 3982 in comm 0): Fatal error in PMPI_Comm_s
plit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3982, new_comm=0xd005f74) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(407461391) on node 3975 (rank 3975 in comm 0): Fatal error in PMPI_Comm_sp
lit: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3975, new_comm=0xc95ddb4) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
Abort(4808207) on node 3572 (rank 3572 in comm 0): Fatal error in PMPI_Comm_spli
t: Other MPI error, error stack:
PMPI_Comm_split(507)...................: MPI_Comm_split(MPI_COMM_WORLD, color=-3
2766, key=3572, new_comm=0x1552874) failed
PMPI_Comm_split(489)...................: 
MPIR_Comm_split_impl(167)..............: 
MPIDI_SHMGR_Gather_generic(1195).......: 
MPIDI_NM_mpi_allgather(352)............: 
MPIR_Allgather_intra_knomial(216)......: 
MPIC_Isend(525)........................: 
MPID_Isend(345)........................: 
MPIDI_OFI_send_lightweight_request(110): 
MPIDI_OFI_send_handler(726)............: OFI tagged inject failed (ofi_impl.h:72
6:MPIDI_OFI_send_handler:Invalid argument)
[cli_3572]: readline failed
[proxy:0:82@atom84] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:82@atom84] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:82@atom84] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:82@atom84] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:82@atom84] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): err
or waiting for event
[proxy:0:88@atom90] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:88@atom90] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:88@atom90] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:88@atom90] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:88@atom90] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): err
or waiting for event
[proxy:0:70@atom72] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:70@atom72] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:70@atom72] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:70@atom72] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:70@atom72] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): err
or waiting for event
[proxy:0:21@atom23] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:21@atom23] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:21@atom23] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:21@atom23] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:21@atom23] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:40@atom42] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:40@atom42] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:40@atom42] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:40@atom42] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:40@atom42] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:64@atom66] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:64@atom66] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:64@atom66] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:64@atom66] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:64@atom66] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:58@atom60] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:58@atom60] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:58@atom60] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:58@atom60] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:58@atom60] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:34@atom36] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:34@atom36] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:34@atom36] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:34@atom36] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:34@atom36] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:46@atom48] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:46@atom48] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:46@atom48] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:46@atom48] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:46@atom48] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
[proxy:0:28@atom30] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/
hydra_sock_intel.c:353): [proxy:0:14@atom16] HYD_sock_write (../../../../../src/
pm/i_hydra/libhydra/sock/hydra_sock_intel.c:353): write error (Bad file descript
or)
[proxy:0:14@atom16] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:14@atom16] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:14@atom16] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:14@atom16] main (../../../../../src/pm/i_hydra/proxy/proxy.c:989): erro
r waiting for event
write error (Bad file descriptor)
[proxy:0:28@atom30] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/prox
y_cb.c:33): error reading command
[proxy:0:28@atom30] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/pro
xy/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:28@atom30] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/
libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:28@atom30] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): err
or waiting for event
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[mpiexec@atom2] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydr
a_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:7@atom9] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hy
dra_sock_intel.c:353): write error (Bad file descriptor)
[proxy:0:7@atom9] cmd_bcast_non_root (../../../../../src/pm/i_hydra/proxy/proxy_
cb.c:33): error reading command
[proxy:0:7@atom9] proxy_upstream_control_cb (../../../../../src/pm/i_hydra/proxy
/proxy_cb.c:155): error forwarding cmd downstream
[proxy:0:7@atom9] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/li
bhydra/demux/hydra_demux_poll.c:79): callback returned error status
[proxy:0:7@atom9] main (../../../../../src/pm/i_hydra/proxy/proxy.c:1035): error
 waiting for event


intel mpi at 4000 ranks

$
0
0

Hi, we're testing intel mpi on Centos7.5 with infiniband connections.

Using intel mpi benchmark, small scale tests (10node, 400 mpi ranks)  looks OK while 100 nodes (4000 ranks) job crashes.  FI_LOG_LEVEL=debug yielded a following message:

libfabric:verbs:fabric:fi_ibv_create_ep():173<info> rdma_create_ep: Invalid argument(22)
libfabric:ofi_rxm:ep_ctrl:rxm_eq_sread():575<warn> fi_eq_readerr: err: 111, prov_err: Unknown error -28 (-28)
libfabric:verbs:fabric:fi_ibv_set_default_attr():1085<info> Ignoring provider default value for tx rma_iov_limit as it is greater than the value supported by domain: mlx5_0

Would there be any way to trace the cause of the issues? Any comments are appreciated.

Thanks,

BJ

Where can I download MPI runtime redistributable as a separate package

$
0
0

Hi, I am having difficulty locating runtime redistributables package (.tz/.tar.gz) on your website. Can anyone point the me download location.

How to use MCDRAM in Hybrid Mode on Theta

$
0
0

Hi,

I have a question about how to how to use MCDRAM in Hybrid Mode. For example, when using MCDRAM in Hybrid Mode, if I call the Cache path that MCDRAM uses as a cache, and call the HBM path that MCDRAM uses as addressable memory. Can I only allocate the data on Cache Path or only allocate the data on HBM Path by using numactl -m like Flat Mode? I assume by default when using the MCDRAM in Hybrid Mode, the data will be only allocated on Cach Path. And if we adding the tag numactl -m, the data can be allocated in HBM path only.  I don't know if my guess is right or not. Any suggestions or commands are welcome.   

I appreciate all your feedback, Thank you! 

HPCC benchmark "Begin of MPIRandomAccess section" hangs

$
0
0

Hello,

I am trying to run the HPCC-1.5.0 benchmark on the cluster using the intel-2019 compilers and mpi. I was able to successfully compile the hpcc code and run it on up to 4 cores on the head node. But if I increase the number of cores, the benchmark seems to hang at the "Begin of MPIRandomAccess section" (which is the very first benchmark test). I can run the same code successfully using intel-2013 compilers and mpi. Has anyone else faced anything similar or have any pointers to what could be happening and how to fix it? Any help is greatly appreciated!

Thank you

Krishna

Intel MPI - Unable to run on Microsoft Server 2016

$
0
0

We are trying to run parallel on a single node using Intel MPI - 2018.0.124 and getting the following error. 

..\hydra\pm\pmiserv\pmiserv_cb.c (834): connection to proxy 0 at host XXX-NNNN failed
..\hydra\tools\demux\demux_select.c (103): callback returned error status
..\hydra\pm\pmiserv\pmiserv_pmci.c (507): error waiting for event
..\hydra\ui\mpich\mpiexec.c (1148): process manager error waiting for completion

 We have checked hydra-service status and found that to be working. 

mpiexec also seems to be working ok. 

mpiexec -n 2 hostname - returns the localhost name

mpiexec -validate - returns success

We have also checked that the hydra service is running the version we want and it is the only version in the machine. 

Is there anything we can do to check why the runs fail?

Thanks!

 

IMB Alltoall hang with Intel Parallel Studio 2018.0.3

$
0
0

Hi,

   When running IMB Alltoall at 32 ranks/node on 100 nodes, job stalls before printing the 0-byte data. Processes seem to be in sched_yield() when traced. With 2, 4, 8, or 16 ranks/node, job runs fine.

   Cluster is dual-socket Skylake, 18 cores/socket. ibv_devinfo shows as below. Running Centos 7.4. We've been having reproducible trouble with Intel MPI and high rank counts on our system, but are still troubleshooting whether it's a fabric or an MPI issue.

   Job launched with

srun -n 3200 --cpu-bind=verbose --ntasks-per-socket=16 src/IMB-MPI1 -npmin 3200 Alltoall

 

Thanks; Chris

 

[cchang@r4i2n26 ~]$ ibv_devinfo -v hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 12.21.1000 node_guid: 506b:4b03:002b:e41e sys_image_guid: 506b:4b03:002b:e41e vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: SGI_P0001721_X phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe17e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN XRC Unknown flags: 0xe16e0000 device_cap_exp_flags: 0x5048F8F100000000 EXP_DC_TRANSPORT EXP_CROSS_CHANNEL EXP_MR_ALLOCATE EXT_ATOMICS EXT_SEND NOP EXP_UMR EXP_ODP EXP_RX_CSUM_TCP_UDP_PKT EXP_RX_CSUM_IP_PKT EXP_DC_INFO EXP_MASKED_ATOMICS EXP_RX_TCP_UDP_PKT_TYPE EXP_PHYSICAL_RANGE_MR Unknown flags: 0x200000000000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) log atomic arg sizes (mask) 0x8 masked_log_atomic_arg_sizes (mask) 0x3c masked_log_atomic_arg_sizes_network_endianness (mask) 0x34 max fetch and add bit boundary 64 log max atomic inline 5 max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 hca_core_clock: 156250 max_klm_list_size: 65536 max_send_wqe_inline_klms: 20 max_umr_recursion_depth: 4 max_umr_stride_dimension: 1 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT max_size: 0xFFFFFFFFFFFFFFFF rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND dc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ xrc_odp_caps: NO SUPPORT raw_eth_odp_caps: NO SUPPORT max_dct: 262144 max_device_ctx: 1020 Multi-Packet RQ is not supported rx_pad_end_addr_align: 64 tso_caps: max_tso: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps ooo_caps: ooo_rc_caps = 0x0 ooo_xrc_caps = 0x0 ooo_dc_caps = 0x0 ooo_ud_caps = 0x0 sw_parsing_caps: supported_qp: tag matching not supported tunnel_offloads_caps: Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 2000 port_lmc: 0x00 link_layer: InfiniBand max_msg_sz: 0x40000000 port_cap_flags: 0x2651e848 max_vl_num: 4 (3) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len: 8 subnet_timeout: 18 init_type_reply: 0 active_width: 4X (2) active_speed: 25.0 Gbps (32) phys_state: LINK_UP (5) GID[ 0]: fec0:0000:0000:0000:506b:4b03:002b:e41e

General question of Intel Trace Analyzer and Collector

$
0
0

Hi:

I'm new to here and may need to ask some question about this tool:  Intel Trace Analyzer and Collector 

Is this software intel Xeno exclusive or it can be multi-platform.

Also if our company want to purchase this tool, where should I ask?

 

Many thanks

Chi


"dapl fabric is not available and fallback fabric is not enabled"

$
0
0

HI Support team,

I try to use  Intel MPI  "with DAPL fabric " to run a molding simulation software on Infiniband/RDMA Fabric.

But get error - "dapl fabric is not available and fallback fabric is not enabled"

 

Detail info:

Cluster  nodes:
CPU: Intel Xeon E5-1620
RAM: 32 GB
NIC: Mellanox ConnectX-5 VPI adapter
Driver: WinOF-2 v2.10.50010
OS: Windows Server 2016 Standard

Test Case1:
Using Microsoft MPI - Microsoft HPC Pack 2016 Update 2 + fixes , the Results is works well on infiniband fabric.

Test Case2:
Replace MS_MPI with Intel MPI(IMPI) - Intel MPI 2018 , the Results - Get error - "dapl fabric is not available and fallback fabric is not enabled" when execute Intel MPI command. My command is as below:

Command: c:\Users\Administrator>"C:\Program Files\Intel MPI 2018\x64\impiexec.exe -genv I_MPI_DEBUG 5 -DAPL -host2 192.168.191.21 192.168.181.22 1 \\IBCN3\Moldex3D_R17]Bin\IMB-MPI1.exe

Please advise,

 

BRs,

Jeffrey

Job terminates abnormally on HPC using intel mpi

$
0
0

Hello all,

I have recently installed a program cp2k on HPC using "Intel(R) Parallel Studio XE 2017 Update 4 for Linux", after successful installation when I'm running the executable using mpirun -machinefile $PBS_NODEFILE -n 40 ./cp2k $var >& out I get the following error message at the end of the output file and my job terminates:-

rank = 33, revents = 25, state = 8
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2988: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 7

I'm using the following JOB script :-

#!/bin/bash
#PBS -N test
#PBS -q mini
#PBS -l nodes=2:ppn=20
cd $PBS_O_WORKDIR
export I_MPI_FABRICS=shm:tcp
export I_MPI_MPD_TMPDIR /scratch/$USER
EXEC=~/cp2k-6.1/exe/Linux-x86-64-intelx/cp2k.popt
cp $EXEC cp2k
mpirun -machinefile $PBS_NODEFILE -n 40 ./cp2k $var >& out
rm cp2k

I shall be grateful for the help.

Thank you,
Raghav

Running executable compiled with MPI without using mpirun

$
0
0

Hello,

I'm having trouble running the abinit (8.10.2) executable (electronic structure program; www.abinit.org) after compiling with the Intel 19 Update 3 compilers and with MPI enabled (64 bit intel linux).

If I compile with either the gnu tools or the Intel tools (icc, ifort), and without MPI enabled, I can directly run the abinit executable with no errors.

If I compile with the gnu tools and MPI enabled (using openmpi), I can still run the abinit executable direclty (without using mpirun) without errors.

If I compile with the Intel tools (mpiicc, mpiifort) and MPI enabled (using intel MPI), and then try to run the abinit executable directly (without mpirun), then it fails with the following error when trying to read in the input file (t01.input):

abinit < t01.input > OUT-traceback
forrtl: severe (24): end-of-file during read, unit 5, file /proc/26824/fd/0
Image                 PC                              Routine                 Line           Source             
libifcoremt.so.5   00007F0847FAC7B6  for__io_return        Unknown  Unknown
libifcoremt.so.5   00007F0847FEAC00  for_read_seq_fmt   Unknown  Unknown
abinit                  000000000187BC1F  m_dtfil_mp_iofn1_   1363        m_dtfil.F90
abinit                  0000000000407C49  MAIN__                    251          abinit.F90
abinit                  0000000000407942  Unknown                 Unknown  Unknown
libc-2.27.so         00007F08459E4B97  __libc_start_main     Unknown  Unknown
abinit                  000000000040782A  Unknown                 Unknown  Unknown

If I compile with the Intel tools and MPI enabled and run the abinit executable with "mpirun -np 1 abinit < t01.input > OUT-traceback" then reading the input file succeeds.

Running the MPI enabled executable without mpirun succeeds when compiled with the gnu tools, but not when compiled with the intel tools.

A colleague of mine compiled abinit with MPI enabled using the Intel 17 compiler and IS able to run the abinit executable without mpirun.

I am using Intel Parallel Studio XE Cluster Edition Update 3 and source psxevars.sh to set the environment before compiling/running with intel. The output of mpiifort -V is:

mpiifort -V
Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.3.199 Build 20190206
Copyright (C) 1985-2019 Intel Corporation.  All rights reserved.

Any ideas on what is causing this forrtl crash?

Thanks for any suggestions.

I_MPI_WAIT_MODE replacement in Intel MPI?

MPI Processes - socket mapping and threads per process - core mapping

$
0
0

Hi,
I have a 2 socket 20 cores per socket (ntel(R) Xeon(R) Gold 6148 CPU) node .
I wish to launch 1 process per socket and 20 threads per process and if possible - all threads should be pinned to their respective cores.

earlier i used to run intel binaries on cray machine with similar cores , and the syntax was - 
aprun –n (mpi tasks) –N (tasks per node) –S (tasks per socket) –d (thread depth) <executable> , example - 

OMP_NUM_THREADS=20
aprun -n4 -N2 -S1 -d $OMP_NUM_THREADS ./a.out
 

node 0 socket 0 process#0 nprocs 4 thread id  0  nthreads 20  core id  0
node 0 socket 0 process#0 nprocs 4 thread id  1  nthreads 20  core id  1
....
node 0 socket 0 process#0 nprocs 4 thread id 19  nthreads 20  core id 19
node 0 socket 1 process#1 nprocs 4 thread id  0  nthreads 20  core id 20
...
node 0 socket 0 process#1 nprocs 4 thread id 19  nthreads 20  core id 39
....
node 1 socket 0 process#1 nprocs 4 thread id 19  nthreads 20  core id 39

 

How can i achieve the same/equivalent effect using intel's mpirun?

Viewing all 952 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>