Open Sourcing TonY: Native Support of TensorFlow on Hadoop
LinkedIn heavily relies on artificial intelligence to deliver content and create economic opportunities for its 575+ million members. Following recent rapid advances of deep learning technologies, our AI engineers have started adopting deep neural networks in LinkedIn’s relevance-driven products, including
In the beginning, our internal TensorFlow users ran the framework on small and unmanaged “bare metal” clusters. But we quickly realized the need to connect TensorFlow to the massive compute and storage power of our
Open sourcing TonY
To meet our needs, and because we know there are many others interested in running distributed machine learning who are also running large Hadoop deployments, we have built TensorFlow on YARN (TonY), which we are open sourcing today. Please check out the TonY project on GitHub for details on how to use it. Contributions and suggestions from the community are welcome!
In the rest of this blog post, we will cover the internal details of TonY, the features we have implemented and leveraged to scale distributed TensorFlow on Hadoop, and experimental results.
Existing solutions In our initial investigation into running distributed TensorFlow on Hadoop, we found a few existing solutions. However, we ultimately determined that none met our particular requirements, leading to our decision to build TonY.
TensorFlow on Spark is an open source solution that enables you to run TensorFlow on the Apache Spark computing engine. We were able to onboard a couple of our internal deep learning applications on this framework, but ran into a few issues, most notably a lack of both GPU scheduling and heterogeneous container scheduling. Also, any scheduling and application lifecycle enhancements we wanted to make in the future would have to be done in Spark, which is much more difficult than making the change in a self-contained YARN application.
TensorFlowOnYARN is another open source solution that runs as a separate library. Unfortunately, fault tolerance support and usability in this project did not fit our needs. Furthermore, this project is no longer maintained.
For these reasons, we decided to build TonY to give us complete control over the resources in our Hadoop clusters. Also, since TonY is running directly on YARN and runs as a lightweight dependency, we can easily evolve it with both the lower-level part of the stack in YARN, or the higher-level part of the stack in TensorFlow.
How does TonY work?
Similar to how MapReduce provides the engine for running Pig/Hive scripts on Hadoop, and Spark provides the engine for running scala code that uses Spark APIs, TonY aims to provide the same first-class support for running TensorFlow jobs on Hadoop by handling tasks such as resource negotiation and container environment setup.