Building / Compiling MXNet errors

MXNet is a popular machine learning / deep learning framework. Compared to its peers such as TensorFlow, it is not as popular.

Indeed, my last search on safari books online only yielded publications with the mxnet name mentioned but no concrete examples on how to build or use it.

In this post I aim to show some of the common errors I encountered while building it manually. In future posts, I will demonstrate how I build it from scratch using multi-stage builds.

Below is a list of such errors I encountered.

Please note that its not meant to be a comprehensive / exhaustive list and as usual, different system setup and requirements may / may not result in different errors than mine.

Compilation Failures

In my earlier attempts at compilation, the process would fail suddenly with errors such as:

1 ...
2 c++: internal compiler error: Killed (program cc1plus)
3 Please submit a full bug report,
4 with preprocessed source if appropriate.
5 See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
6 make[2]: *** [CMakeFiles/mxnet_static.dir/src/operator/tensor/indexing_op.cc.o] Error 4
7 make[2]: *** Waiting for unfinished jobs....
8 ...

This is related to a gcc 7.5 memory leak issue

Assuming you have gcc 8 installed, we can use gcc 8 by passing in the following options during compilation…

1 export CC="gcc-8" && \
2 export CXX="g++-8"

Examples show compilation through ninja build tool. However, my attempts at using it have been unsuccessful. I find that at least on my local machine, ninja tends to consume more CPU and memory resources than it should whereas with cmake build tool, I have more control over the number of processes and it also shows the logs in the console than using the former.

ONNX Issues

If compilation fails with namespace not found with onnx option set, then set the appropriate environment variable:

1 export ONNX_NAMESPACE=onnx

After compilation but while loading the framework, if it fails with File already exists: ... onnx-ml.proto, this is due to the protobuf library being built as a shared library and since onnx-ml.proto is also symlinked by other libs, it will raise an error.

The only solution to this is to remove all previous installs of protobuf and build it again manually. The below works for me:

 1 git clone --recursive -b 3.5.1.1 https://github.com/google/protobuf.git && \
 2     cd protobuf && \
 3     ./autogen.sh && \
 4     ./configure --disable-shared CXXFLAGS=-fPIC --prefix=/protobufbuild && \
 5     make -j4 && \
 6     make install && \
 7     ldconfig && \
 8     cd / && \
 9     cp /protobufbuild /usr/local && \
10     rm -rf protobuf

Note the --disable-shared option which builds it as a static library.

For running onnx in python, we need to install the onnx pypi package. A compatible version I found to work across all mxnet versions from v1.6.x - v1.8.x is onnx==1.3.0

MKLDNN Issues

MKLDNN is the intel graphics driver for running ML operations on Intel-compatible CPUs. It’s a suitable replacement for NVIDIA gpus. More information on Intel MKLDNN

The errors below only apply for me when I was trying to compile a python wheel of mxnet. It may not be applicable in your use case.

While compiling the python wheel, I had to enable TVM and it threw an error of:

 1 Traceback (most recent call last):
 2     File "/mxnet/contrib/tvmop/compile.py", line 20, in <module>
 3       import tvm
 4     File "/mxnet/3rdparty/tvm/python/tvm/__init__.py", line 36, in <module>
 5       from . import target
 6     File "/mxnet/3rdparty/tvm/python/tvm/target.py", line 70, in <module>
 7       raise err_msg
 8     File "/mxnet/3rdparty/tvm/python/tvm/target.py", line 66, in <module>
 9       from decorator import decorate
10   ModuleNotFoundError: No module named 'decorator'

Ensure that the decorator==4.4.2 pypi package is present before building the 3rd party plugins.

Another issue I encountered was not being able to locate / load the TVM config file issue

After building mxnet, you need to run the following to copy the generated tvmop.conf file from the build folder into /usr/local/lib/<python version>/lib:

1 mkdir -p /usr/local/lib/python3.6/lib && \
2 cp tvmop.conf /usr/local/lib/python3.6/lib/

CUDA versions

Another point to note is that for v1.8.x and above, it uses cuda version 10.2 and above. If you are building multiple versions of MXNet in Docker your Dockerfile would need to take into account the different versions required else compilation will not work.

Conclusion

In conclusion, it takes some effort to get MXNet to compile from source but you will learn a lot about the framework just by doing it - I certainly did.

Happy Hacking.