When trying to open file with MPI_File_open using MPI_MODE_EXCL, error Should be return if creating file that already exists. But, is the file opened eventually and a valid file handle returned?
thanks
When trying to open file with MPI_File_open using MPI_MODE_EXCL, error Should be return if creating file that already exists. But, is the file opened eventually and a valid file handle returned?
thanks
Hi everyone,
I struggle to make Intel® Parallel Studio XE 2019 to run this simple hello_mpi.f90:
program hello_mpi
implicit none
include 'mpif.h'
integer :: rank, size, ierror, tag
integer :: status(MPI_STATUS_SIZE)
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
print*, 'node', rank, ': Hello world'
call MPI_FINALIZE(ierror)
end program hello_mpi
I tried first with Update3 and now Update4, and although the code compiles I get this error when trying to run it:
mpirun -np 4 ./hello_mpi
.../intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpirun: line 103: 13574 Floating point exceptionmpiexec.hydra "$@" 0<&0
Here is what gdb --args mpiexec.hydra -n 2 hostname returns:
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/uio/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpiexec.hydra...done.
(gdb) run
Starting program: .../intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpiexec.hydra -n 2 hostname
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Program received signal SIGFPE, Arithmetic exception.
IPL_MAX_CORE_per_package () at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:336
336 ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install intel-mpi-rt-2019.4-243-2019.4-243.x86_64
Can someone help?
Thanks in advance,
Jean
I am working with star ccm+ 2019.1.1 Build 14.02.012
CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with star ccm+)
Cisco UCS cluster using USNIC fabric over 10gbe
Intel(R) Xeon(R) CPU E5-2698
7 nodes, 280 cores
enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installed
usnic RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installed
enic modinfo version: 3.2.210.22
enic loaded module version: 3.2.210.22
usnic_verbs modinfo version: 3.2.158.15
usnic_verbs loaded module version: 3.2.158.15
libdaplusnic RPM version 2.0.39cisco3.2.112.8 installed
libfabric RPM version 1.6.0cisco3.2.112.9.rhel7u6 installed
On runs less than 5 hours, everything works flawlessly and is quite fast.
However when running with 280 cores at or around 5 hours into a job, the longer jobs die with the floating point exception.
The same job completes fine with 140 cores, but takes about 14 hours to finish.
Also I am using PBS Pro with 99 hour wall time
------------------
Turbulent viscosity limited on 56 cells in Region
A floating point exception has occurred: floating point exception [Overflow]. The specific cause cannot be identified. Please refer to the troubleshooting section of the User's Guide.
Context: star.coupledflow.CoupledImplicitSolver
Command: Automation.Run
error: Server Error
------------------
I have been doing some reading and some say that using other MPI are more stable with Star CCM.
I have not ruled out that I am missing some parameters or tuning with Intel MPI as this is a new cluster.
I am also trying to make Open MPI work. I have openmpi compiled and it runs, however only with very small number of CPU. Anything over about 2 cores per node it hangs indefinately.
I have compiled Open MPI 3.1.3 from https://www.open-mpi.org/ because this is what Star CCM version I am running supports. I am telling star to use the open mpi that I installed so it can support the Cisco USNIC fabric, which I can verify using Cisco native tools. Note that star also ships with openmpi however
I am thinking that I need to tune OpenMPI, which was also requried with Intel MPI.
With Intel MPI, jobs with more than about 100 cores would hang until I added these parameters:
reference: https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technolog...
reference: https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-a...
export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208
export I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208
export I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704
export I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704
export I_MPI_DAPL_UD_RNDV_EP_NUM=2
export I_MPI_DAPL_UD_REQ_EVD_SIZE=2000
export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096
export I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647
After adding these parms I can scale to 280 cores and it runs very fast, up until the point where it gets the floating point exception.
I am banging my head against a wall trying to find equivelant turning parms for Open MPI.
I have listed all the MCA available with Open using MCA, and have tried setting these parms with no success.
btl_max_send_size = 4096
btl_usnic_eager_limit = 2147483647
btl_usnic_rndv_eager_limit = 2147483647
btl_usnic_sd_num = 8208
btl_usnic_rd_num = 8208
btl_usnic_prio_sd_num = 8704
btl_usnic_prio_rd_num = 8704
btl_usnic_pack_lazy_threshold = -1
Does anyone have any advice or ideas for:
1.) The floating point overflow issue
and
2.) Know of equivelant tuning parms for Open MPI
Many thanks in advance
So I've been trying to unconfuse myself about the various fabrics/transports supported by intel mpi 2018/19 as used for the `I_MPI_FABRICS` variable and related ones. I have a few questions I'm hoping someone can help with - they're all related so have put them in one thread:
1. Is there a way of getting intel mpi to output which transports/fabrics it thinks are available on a machine? Or is it just a question of trying each one with fallback disabled?
2. 2018 has a `tcp` fabric while 2019's `OFI` fabric has a `TCP` provider. Am I right in thinking these are *not* the same, with the former not using `libfabric` at all?
3. 2019 (only?)'s `OFI` fabric has an `RxM` provider. The other OFI providers seem tied to a specific hardware but I'm not clear if this one is, or is it something more fundamental?
4. 2018 had a `ofa` fabric which says it supports InfiniBand through OFED Verbs:
a) Am I right in thinking this is *not* the same verbs interface as is provided by the `ofi` fabric's `verbs` provider?
b) Does the OFA/OFED Verbs interface support anything other than InfiniBand?
Many thanks for any answers!
Dear all,
I am using Intel Parallel Studio XE 2017.6 in order to trace a HYBRID OPENMP/MPI application.
I use:
```mpiexec.hydra -trace "libVT.so libmpi.so" python ....py args```
and although the application runs fine and an .stf file is created with reasonable results
the log file of my application's execution gives me the error:
ERROR: ld.so: object ''libVT.so' from LD_PRELOAD cannot be preloaded: ignored.
I would expect this error to be resolved by using :
export LD_PRELOAD=.../libVT.so
however it still persists.
In case where I remove "libVT.so libmpi.so" from the command above I get:
ERROR: ld.so: object ''libVT.so' from LD_PRELOAD cannot be preloaded: ignored.
python: symbol lookup error: /rdsgpfs/general/apps/intel/2017.6/itac/2017.4.034/intel64/slib/libVT.so: undefined symbol: PMPI_Initialized
and my application terminates without success.
Does that mean that even if it complains for faulty preloading, it still uses that? (I guess yes.)
Should I trust the results I get for the `succesfull` however `complaining` execution?
I will be more than happy to help with more info if needed.
Thank you in advance,
George Bisbas
Hi, I found that the performance difference is significant when running
I_MPI_DEBUG=1 mpirun -PSM2 -host node1 -n 1 ./IMB-MPI1 Sendrecv : -host node2 -n 1 ./IMB-MPI1 ...... #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec ...... 4194304 10 364.40 364.52 364.46 23012.87
and
I_MPI_DEBUG=1 mpirun -OFI -host node1 -n 1 ./IMB-MPI1 Sendrecv : -host node2 -n 1 ./IMB-MPI1 ...... #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec ...... 4194304 10 487.40 487.80 487.60 17196.66
Output of the latter seems to indicate that it uses psm2 backend too.
[0] MPID_nem_ofi_init(): used OFI provider: psm2 [0] MPID_nem_ofi_init(): max_buffered_send 64 [0] MPID_nem_ofi_init(): max_msg_size 64 [0] MPID_nem_ofi_init(): rcd switchover 32768 [0] MPID_nem_ofi_init(): cq entries count 8 [0] MPID_nem_ofi_init(): MPID_REQUEST_PREALLOC 128 #------------------------------------------------------------ # Intel (R) MPI Benchmarks 2018 Update 1, MPI-1 part #------------------------------------------------------------ # Date : Tue Sep 10 17:36:59 2019 # Machine : x86_64 # System : Linux # Release : 4.4.175-89-default # Version : #1 SMP Thu Feb 21 16:05:09 UTC 2019 (585633c) # MPI Version : 3.1 # MPI Thread Environment: .......
It's getting more wired when I run with I_MPI_FABRICS set, I get only an error.
I_MPI_DEBUG=1 I_MPI_FABRICS=shm,psm2 mpirun -host node81 -n 1 ./IMB-MPI1 Sendrecv : -host node82 -n 1 ./IMB-MPI1 [1] MPI startup: syntax error in intranode path of I_MPI_FABRICS = shm,psm2 and fallback is disabled, allowed value(s) shm,ofi,tmi,dapl,ofa,tcp [0] MPI startup: syntax error in intranode path of I_MPI_FABRICS = shm,psm2 and fallback is disabled, allowed value(s) shm,ofi,tmi,dapl,ofa,tcp
Is the performance difference expected results? If so, can I make mpirun defaults to use -PSM2 by changing environment or configurations? (except aliasing mpirun to "mpirun -PSM2" of course)
Does Intel 19 cluster edition for Windows support the use of USE MPI_F08 in Fortran applications? When I try to compile code with this module on Windows with Intel 19.4 compilers, I get the following error:
error #7002: Error in opening the compiled module file. Check INCLUDE paths. [MPI_F08]
use mpi_f08
------------^
compilation aborted
I have,
mpifc.bat for the Intel(R) MPI Library 2019 Update 4 for Windows*
Copyright 2007-2019 Intel Corporation.
Hi All,
I have two Dell R815 server with 4 AMD opteron 6380 (16 cores each) connected directly by two infiniband cards. I have trouble running the IMB-MPI1 test even on a single node:
mpirun -n 2 -genv I_MPI_DEBUG=3 -genv I_MPI_FABRICS=ofi /opt/intel/impi/2019.5.281/intel64/bin/IMB-MPI1
The run aborted with the following error:
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 2.23 0.00
1 1000 2.24 0.45
2 1000 2.25 0.89
4 1000 2.26 1.77
8 1000 2.24 3.57
16 1000 2.25 7.12
32 1000 2.27 14.08
64 1000 2.43 26.33
128 1000 2.55 50.26
256 1000 3.60 71.08
512 1000 4.12 124.40
1024 1000 5.04 203.00
2048 1000 6.89 297.38
4096 1000 10.56 387.76
8192 1000 13.98 585.83
16384 1000 22.74 720.65
32768 1000 30.12 1087.81
65536 640 46.17 1419.45
131072 320 76.43 1714.87
262144 160 334.23 784.32
524288 80 511.22 1025.57
1048576 40 850.76 1232.51
2097152 20 1518.37 1381.19
Abort(941742351) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Send: Other MPI error, error stack:
PMPI_Send(155)............: MPI_Send(buf=0x3a100f0, count=4194304, MPI_BYTE, dest=1, tag=1, comm=0x84000003) failed
MPID_Send(572)............:
MPIDI_send_unsafe(203)....:
MPIDI_OFI_send_normal(414):
(unknown)(): Other MPI error
However, it runs fine with shm:
mpirun -n 2 -genv I_MPI_DEBUG=3 -genv I_MPI_FABRICS=shm /opt/intel/impi/2019.5.281/intel64/bin/IMB-MPI1
Try to run with 2 CPUs on two different nodes also fail at 4M message size.
I have been struggling with this for a few days now without success. Any suggestions where to look at or what to try?
Thanks!
Qi
Following the directions at
https://software.intel.com/en-us/articles/installing-intel-free-libs-and...
I downloaded the public key from
https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS...
previously and kept a copy. Recently my use of that key started to fail and so I downloaded the key again and noticed that it now has two values in it, which is odd.
I tried using the public key that has two keys in it and it mostly works -- in some of the zones in GCP this works fine but in others it doesn't. In one of them that doesn't I did an "apt-key list" and noticed that one of them was expired:
pub 2048R/7E6C5DBE 2019-09-30 [expires: 2023-09-30]
uid Intel(R) Software Development Products
pub 2048R/1911E097 2016-09-28 [expired: 2019-09-27]
uid "CN = Intel(R) Software Development Products", O=Intel Corporation
I wasn't able to do an "apt-key del" and so I started over with just the new key. it shows up with apt-key list as okay:
pub 2048R/7E6C5DBE 2019-09-30 [expires: 2023-09-30]
uid Intel(R) Software Development Products
but I can't do an update:
$ sudo apt-get update
W: GPG error: https://apt.repos.intel.com/intelpython binary/ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 1A8497B11911E097
W: The repository 'https://apt.repos.intel.com/intelpython binary/ InRelease' is not signed.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W: GPG error: https://apt.repos.intel.com/mkl all InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 1A8497B11911E097
W: The repository 'https://apt.repos.intel.com/mkl all InRelease' is not signed.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W: GPG error: https://apt.repos.intel.com/ipp all InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 1A8497B11911E097
By the way I tried to email otc-digital-experiences@intel.com as mentioned on https://software.intel.com/en-us/faq and it came back as undeliverable.
Hi!
Does anyone know if it is possible to run Intel MPI with Mellanox InfiniBand (ConnectX-5 or 6) cards running Mellanox's latest WinOF-2 v2.2 in a Windows 10 environment? I've been googling and reading for hours but I can't find any concrete information.
This is for running Ansys CFX/Fluent on a relatively small CFD cluster of 4 compute nodes. The current release of CFX/Fluent (2019 R3) runs on Intel MPI 2018 Release 3 by default.
Older versions of Intel MPI (2017, for example) listed specifically "Windows* OpenFabrics* (WinOF*) 2.0 or higher" and "Mellanox* WinOF* Rev 4.40 or higher" as supported InfiniBand software. “Windows OpenFabrics (WinOF)” appears to be dead and does not support Windows 10. The older Mellanox WinOF Rev 4.40 does not support the newest Mellanox IB cards.
The release notes for Intel MPI 2018 and newer does not mention these older InfiniBand software, and instead mentions Intel Omni-Path.
Mellanox's own release notes for WinOF-2 v2.2 only mentions Microsoft MS MPI for the MPI protocol. ANSYS does run on MS MPI, but then I think I would have to move the cluster over to a Windows Server OS environment. I currently run the cluster successfully on Windows 10 using Intel MPI, but over 10GigE and not InfiniBand.
Thanks for any pointers!
Cheers.
I'm using MPI_FILE_WRITE_SHARED to write some error output to a single file. NFS is used so that all nodes write/read to the same files. When I run the program on any single node with multiple processes, error output occurs correctly. However, when I run the code across multiple nodes, nothing get written to the file. Here's a simple test program
Program MPIwriteTest use mpi implicit none integer mpiFHerr, mpiErr, myRank character (len=80) string character(len=2), parameter:: CRLF = char(13)//char(10) ! Initialze MPI and get rank call MPI_INIT( mpierr ) call MPI_COMM_RANK(MPI_COMM_WORLD, myRank, mpierr) ! open and close file MPIerror.dat to delete any existing file call MPI_FILE_OPEN(MPI_COMM_WORLD, 'MPIerror.dat', MPI_MODE_WRONLY+MPI_MODE_CREATE+MPI_MODE_SEQUENTIAL+MPI_MODE_DELETE_ON_CLOSE, & MPI_INFO_NULL, mpiFHerr, mpiErr) call MPI_FILE_CLOSE(mpiFHerr, mpiErr) ! This will delete the file. ! open but don't delete on close call MPI_FILE_OPEN(MPI_COMM_WORLD, 'MPIerror.dat', MPI_MODE_WRONLY+MPI_MODE_CREATE+MPI_MODE_SEQUENTIAL, & MPI_INFO_NULL, mpiFHerr, mpiErr) ! test code just just does simple write write(string,'(a,i0)') 'Error from process: ', myRank call MPI_FILE_WRITE_SHARED(mpiFHerr, trim(string)//CRLF, len_trim(string)+2, MPI_CHARACTER, MPI_STATUS_IGNORE, mpiErr) ! close and end call MPI_FILE_CLOSE(mpiFHerr, mpiErr) call MPI_FINALIZE(mpierr) end program MPIwriteTest
I've also noticed if the file already exists (and I don't do the open and delete_on_close), then the file contains text, but sometime the file is corrupt. Is there something wrong in this code? Is MPI not playing well with NFS?
BTW, I'm using parallel studio xe 2019 update 4 cluster ed.
thanks, -joe
I am trying to run the simple hello world code in fortran using Intel MPI library. But all cores have the same rank, as if the program does not run on more than one core. I was following the troubleshooting procedures provided by Intel (Point 2 - https://software.intel.com/en-us/mpi-developer-guide-windows-troubleshoo...), and I got this:
C:\Program Files (x86)\IntelSWTools>mpiexec -ppn 1 -n 2 -hosts node01,node02 hostname
[mpiexec@Sebastian-PC] HYD_sock_connect (..\windows\src\hydra_sock.c:216): getaddrinfo returned error 11001
[mpiexec@Sebastian-PC] HYD_connect_to_service (bstrap\service\service_launch.c:76): unable to connect to service at node01:8680
[mpiexec@Sebastian-PC] HYDI_bstrap_service_launch (bstrap\service\service_launch.c:416): unable to connect to hydra service
[mpiexec@Sebastian-PC] launch_bstrap_proxies (bstrap\src\intel\i_hydra_bstrap.c:525): error launching bstrap proxy
[mpiexec@Sebastian-PC] HYD_bstrap_setup (bstrap\src\intel\i_hydra_bstrap.c:714): unable to launch bstrap proxy
[mpiexec@Sebastian-PC] wmain (mpiexec.c:1919): error setting up the boostrap proxies
Any ideas how to fix it? Any help would be appreciated.
Argh, please fix on next update.
I am able to compile a hello_world.c program with mpiicc but am unable to get it to run. It works for me with Intel 2014 - 2018, but not with 2019.5.
Debugging output:
=================================================================================== hjohnson@tuxfast:/tmp/intelmpi-2019$ env I_MPI_DEBUG=6 I_MPI_HYDRA_DEBUG=on mpirun -np 1 ./a.out [mpiexec@tuxfast] Launch arguments: /project/software/intel_psxe/2019_update1/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host tuxfast --upstream-port 33709 --pgid 0 --launcher ssh --launcher-number 0 --base-path /project/software/intel_psxe/2019_update1/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /project/software/intel_psxe/2019_update1/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 30410 RUNNING AT tuxfast = KILLED BY SIGNAL: 4 (Illegal instruction) =================================================================================== hjohnson@tuxfast:/tmp/intelmpi-2019$ env I_MPI_DEBUG=6 I_MPI_HYDRA_DEBUG=on gdb ./a.out ... (gdb) run Starting program: /tmp/intelmpi-2019/a.out [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Program received signal SIGILL, Illegal instruction. MPL_dbg_pre_init (argc_p=0x0, argv_p=0x0, wtimeNotReady=61440) at ../../../../src/mpl/src/dbg/mpl_dbg.c:722 722 ../../../../src/mpl/src/dbg/mpl_dbg.c: No such file or directory. (gdb) backtrace #0 MPL_dbg_pre_init (argc_p=0x0, argv_p=0x0, wtimeNotReady=61440) at ../../../../src/mpl/src/dbg/mpl_dbg.c:722 #1 0x00001555545850fe in PMPI_Init (argc=0x0, argv=0x0) at ../../src/mpi/init/init.c:225 #2 0x0000000000400ec3 in main () (gdb) quit A debugging session is active.
Dear all,
I am currently using the MPI distributed graph topologies and I allow rank reordering (https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node195.htm#Node195).
However, after small tests, I noticed that the Intel MPI (2019) does not reorder my ranks.
Since it increases the complexity of the code, I would like to be sure that in some cases it will be useful.
Does Intel MPI reorder the ranks for MPI topologies? If yes, what are the requirements (machine files etc...)?
Thank you very much for your help!
Hi all
Has anybody has much luck/experience with mpitune under 2019? It feels like a **lot** more work than the equivalent sort of activities under previous versions, so I'm wondering if I'm making it more difficult than need be?
Specific 'challenges':
1. 'msg_size' is not considered in the example/supplied tuning configurations. I feel it should be, as that in particular affects the algorithm choice for optimal performance (e.g. ALLTOALLV switches optimal algorithm at message size of 1KB). I guess I can manually fiddle with the mpitune generated JSON configuration, but that doesn't feel quite right, unless I'm just being lazy.
2. I'm not quite sure I understand the supplied and downloadable (https://software.intel.com/en-us/articles/replacing-tuning-configuration...) tuning files, perhaps due to a lack of words surrounding them. Are they used by default in any shape or only when specified?
3. 'Autotuning' - linked to the above - is this intended to be used 'live' and repeatedly or should I be using it once and then capturing something from it? Lots of config variables surrounding this.
Perhaps I'm missing some key document or reading or understanding, but any comments or thoughts would be appreciated.
~~
A
We observed a peculiar behavior of Intel MPI with exit status emitted from STOP statement.
PROGRAM hello USE mpi IMPLICIT NONE INTEGER :: rank, size, ierror CALL MPI_INIT(ierror) CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) PRINT *, 'rank', rank, ': Hello, World!' CALL MPI_FINALIZE(ierror) STOP 2 END
The exit status will be utilized for quick debugging purpose in our codes.
With Intel MPI 2019 Update 4, we received bad termination errors, for instance using 2 MPI ranks:
rank 1 : Hello, World! rank 0 : Hello, World! 2 2 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 167623 RUNNING AT login03 = EXIT STATUS: 2 =================================================================================== =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 1 PID 167624 RUNNING AT login03 = EXIT STATUS: 2 ===================================================================================
This behavior was not observed with previous versions of the library. We are not sure whether this is a bug of IMPI or an intended feature.
Ideally, we are looking for a way to suppress this bad termination error.
Thanks.
I run a 16 node 256 core Dell cluster running on Redhat Enterprise Linux.
Our primary use is to run the engineering software LSTC LS-Dyna. With a recent change in LSTC licensing the newest versions of the software we want to now will only run using IntelMPI (previously we used PlatformMPI).
I cannot however now seem to get the PBS job submission script to work with InelMPI that used to work with PlatformMPI.
The submission script reads (with the last line being the submission line for the L-Dyna testjob.k):
#!/bin/bash
#PBS -l select=8:ncpus=16:mpiprocs=16
#PBS -j oe
cd $PBS_JOBDIR
echo "starting dyna .. "
machines=$(sort -u $PBS_NODEFILE)
ml=""
for m in $machines
do
nproc=$(grep $m $PBS_NODEFILE | wc -l)
sm=$(echo $m | cut -d'.' -f1)
if [ "$ml" == "" ]
then
ml=$sm:$nproc
else
ml=$ml:$sm:$nproc
fi
done
echo Machine line: $ml
echo PBS_O_WORKDIR=$PBS_O_WORKDIR
echo "Current directory is:"
pwd
echo "machines"
/opt/intel/impi/2018.4.274/intel64/bin/mpirun -machines $ml /usr/local/ansys/v170/ansys/bin/linx64/ls-dyna_mpp_s_R11_1_0_x64_centos65_ifort160_sse2_intelmpi-2018 i=testjob.k pr=dysmp
When i attempt to run this job via PBS job manager and I look into the standard error file I see:
[mpiexec@gpunode03.hpc.internal] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@gpunode03.hpc.internal] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:253): unable to write data to proxy
[mpiexec@gpunode03.hpc.internal] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:176): unable to send signal downstream
[mpiexec@gpunode03.hpc.internal] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@gpunode03.hpc.internal] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event
[mpiexec@gpunode03.hpc.internal] main (../../ui/mpich/mpiexec.c:1157): process manager error waiting for completion
I know I can submit a job manually (no PBS involved) and it will run on a node of the cluster ok using the IntelMPI.
So I have boiled the issue down to the section of the submission line that says -machines $ml to do with the node allocation.
For some reason IntelMPI does not accept this syntax whereas PlatormMPI did?
I am quite stumped here and any advice would be greatly appreciated.
Thanks.
Richard.
All,
This is both a question and a feature request.
First the question. What is the "correct" way for Intel MPI to support use of GCC 9? I know (as of Intel MPI 19.0.2, the latest I have access to) that it has support for GCC 4-8. I also know there is a binding kit for Intel MPI that could make bindings for Intel MPI with GCC 9. But, as far as I can tell, the mpif90/mpifc scripts have no idea GCC 9 exists. Does one need to actually edit these scripts to add a 9) case so that the scripts can find the appropriate include/ directories?
Second, the "feature request" but I'm not sure where one makes those. Namely, I was wondering if it's possible for Intel to add support for use mpi_f08 with GCC compilers. At the moment, the included bindings (and the kit) only have the F90 modules and not the F08 modules. I can see not supporting them with really old GCC, but I'm fairly certain I've built Open MPI with GCC 8 and it makes the mpi_f08 mod files just fine. I tried looking at the binding kit, but it only compiles the f90 modules as well it seems.
Thanks,
Matt
Hi All,
As per the recent webinar introducing new Intel MPI 2019 update 5 features, it is now in theory possible to include Intel MPI libaries, and call mpirun for a multi-node MPI job entirely inside a Singularity container, with no need to have Intel MPI installed outside the container. So instead of launching an MPI job in a container using an external MPI stack, like so:
mpirun -n <nprocs> -perhost <procs_per_node> -hosts <hostlist> singularity exec <container_name> <path_to_executable_inside_container>
one should now be able to do:
singularity exec <container_name> mpirun -n <nprocs> -perhost <procs_per_node> -hosts <hostlist> <path_to_executable_inside_container>
I have the Intel MPI 2019.5 libraries (as well as Intel run-time libraries for C++), plus libfabric, inside my container, along with sourcing the following in the container:
cat /.singularity.d/env/90-environment.sh #!/bin/sh # Custom environment shell code should follow source /opt/intel/bin/compilervars.sh intel64 source /opt/intel/impi/2019.5.281/intel64/bin/mpivars.sh -ofi_internal=1 release
This is not working so far. Below I illustrate with a simple test, and run from inside the container (shell mode), and get the following error messages after about 20-30 seconds of the command just hanging with no output:
Singularity image.sif:~/singularity/fv3-upp-apps> export I_MPI_DEBUG=500 Singularity image.sif:~/singularity/fv3-upp-apps> export FI_PROVIDER=verbs Singularity image.sif:~/singularity/fv3-upp-apps> export FI_VERBS_IFACE="ib0" Singularity image.sif:~/singularity/fv3-upp-apps> export I_MPI_FABRICS=shm:ofi Singularity image.sif:~/singularity/fv3-upp-apps> mpirun -n 78 -perhost 20 -hosts appro07,appro08,appro09,appro10 hostname [mpiexec@appro07.internal.redlineperf.com] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:114): unable to run proxy on appro07 (pid 109898) [mpiexec@appro07.internal.redlineperf.com] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:152): check exit codes error [mpiexec@appro07.internal.redlineperf.com] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:205): poll for event error [mpiexec@appro07.internal.redlineperf.com] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:731): error waiting for event [mpiexec@appro07.internal.redlineperf.com] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1919): error setting up the boostrap proxies
I also tried just calling mpirun using just one host (and only enough processes that fit on one host), with the same result.
Is there a specific list of dependencies (e.g. do I need openssh-clients installed?) to use this all-inside-the-container approach? I do not see anything within the Intel MPI 2019 upsate 5 Developer Reference about running with Singularity containers.
Thanks, Keith