{"id":119,"date":"2025-11-03T11:57:38","date_gmt":"2025-11-03T09:57:38","guid":{"rendered":"https:\/\/sisu.ut.ee\/research-example\/?page_id=118"},"modified":"2025-11-10T16:35:06","modified_gmt":"2025-11-10T14:35:06","slug":"newt","status":"publish","type":"page","link":"https:\/\/sisu.ut.ee\/mc\/projects\/newt\/","title":{"rendered":"NEWT"},"content":{"rendered":"<p>A fault tolerant BSP framework on Hadoop YARN<\/p>\n\n\n\n<p>We looked for a distributed computing framework that could be utilized for running iterative scientific computing algorithms in the cloud, but most solutions proved lacking in some respect. This is why we decided to create our own with the following goals:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Provide automatic fault recovery.<\/li>\n\n\n\n<li>Retain the program state after fault recovery.<\/li>\n\n\n\n<li>Provide a convenient programming interface.<\/li>\n\n\n\n<li>Support (iterative) scientific computing applications.<\/li>\n<\/ul>\n\n\n\n<p>The following is an example of a linear system of equations solver, using the conjugate gradient method, implemented in NEWT:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>One has to register the program definition with the application master.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code has-light-background-color has-background\"><code>\npublic class CGTestMaster {\n\tpublic static void main(String[] args) {\n\t\t\/\/simply initialize the default configuration\n\t\tConfiguration conf = NEWTConfiguration.createDefault();\n\t\t\/\/the test data will be generated by the worker nodes\n\t\t\/\/so only use commandline arguments as input and \n\t\t\/\/specify the number of processes to request\n\t\tInput input = new CommandLineInput(conf, args, Integer.parseInt(args[0]));\n\t\t\/\/create an application master, using the configuration,\n\t\t\/\/the state class definition and the input\n\t\tFaultTolerantJobMaster am =\n\t\t\t\tnew FaultTolerantJobMaster(conf, CGState.class, input);\n\t\t\/\/register the application code with the master\n\t\tam.addStage(\"init\", new CG.Init());\n\t\tam.addStage(\"beginLoop\", new CG.BeginLoop());\n\t\tam.addStage(\"checkCondition\", new CG.CheckCondition());\n\t\tam.addStage(\"doMatVec\", new CG.DoMatVec());\n\t\t\/\/register our custom message type\n\t\tam.registerMessageType(DoubleArrayWritable.class);\n\t\t\/\/finally, run the application master\n\t\tam.run();\n\t}\n}<\/code><\/pre>\n\n\n\n<p>The\u00a0<strong>CGState.class<\/strong>\u00a0contains the definitions of all state variables used by the algorithm and instructions to serialize them to a bitstream. The objects that are given as arguments to\u00a0<strong>am.addStage<\/strong>\u00a0contain the code of the program, an example of these stages is:<\/p>\n\n\n\n<pre class=\"wp-block-code has-light-background-color has-background\"><code>\n\/\/verbose Java 1.6 \"closure\" definition\n\/\/can be made much more concise with Java 1.7\n\/\/or a Scala interface\npublic static class CheckCondition extends Stage {\n\t\/\/the method takes the BSPComm communicatior,\n\t\/\/the state object and the superstep number as arguments\n\t@Override\n\tpublic String execute(BSPComm bsp, CGState state, int superstep) {\n\t\t\/\/we know that on the previous superstep each process\n\t\t\/\/sent a message of 2 doubles, containing the partial\n\t\t\/\/dot product and local maximum norm of the residual\n\t\t\/\/vector (which is used as an error estimate)\n\t\tdouble[] recv = getFromAll(bsp, 2);\n\t\tstate.u += recv[0];\n\t\tstate.error = Math.max(state.error, recv[1]);\n\t\t\/\/depending on some state variables, determine what should be done next\n\t\tif (state.error &gt; TOLERANCE &amp;&amp; state.it &lt;= MAX_IT) {\n\t\t\tif (state.it == 0) {\n\t\t\t\tfor (int i = 0; i &lt; state.p.size(); i++) {\n\t\t\t\t\tstate.p.getData()[i] = state.z.getData()[i];\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\tdouble v = state.u\/state.ou;\n\t\t\t\tCGMath.scalevector(state.p.getData(), v);\n\t\t\t\tCGMath.addVector(state.p.getData(), state.z.getData());\n\t\t\t}\n\t\t\t\/\/we need to synchronize the vector p for matrix vector multiplication\n\t\t\tsendToNeighbors(bsp, state, state.p.getData());\n\t\t\t\/\/do matrix vector multiplication on the next superstep\n\t\t\treturn \"doMatVec\";\n\t\t} else {\n\t\t\t\/\/we're done, write result to disk and stop\n\t\t\treturn Stage.STAGE_END;\n\t\t}\n\t}\n}\n<\/code><\/pre>\n\n\n\n<p>Here the\u00a0<strong>getFromAll<\/strong>\u00a0and\u00a0<strong>sendToNeighbors<\/strong>\u00a0are implemented using the communicator\u2019s send function to transmit parts of some of the state from the\u00a0<strong>CGState<\/strong>\u00a0object. Since these are common communication patterns, these might eventually be implemented on the framework level. Some mathematical operations that need to be performed on the state variables are implemented in a separate class\u00a0<strong>CGMath<\/strong>.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>To run the program, it has to be submitted to the YARN cluster:<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code has-light-background-color has-background\"><code>public class TestJob {\n\tpublic static void main(String[] args) {\n\t\tNEWTClient client = new NEWTClient();\n\t\tclient.submit(CGTestMaster.class, args, true);\n\t\tclient.close();\n\t}\n}\n<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<p>While the framework is not in the shape where it can be used for any serious projects yet, it\u2019s open-source and available at:\u00a0<a href=\"https:\/\/bitbucket.org\/mobilecloudlab\/newt\">https:\/\/bitbucket.org\/mobilecloudlab\/newt<\/a>Categories<a href=\"https:\/\/mc.cs.ut.ee\/category\/projects\/\">Projects<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading has-regular-font-size\">Master theses <\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ilja\u00a0Kromonov, <a href=\"https:\/\/thesis.cs.ut.ee\/24293d04-c125-4cb9-ac7a-c03fe7597494\">Fault Tolerant Distributed Computing Framework for Scientific Algorithms<\/a><\/li>\n<\/ul>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A fault tolerant BSP framework on Hadoop YARN We looked for a distributed computing framework that could be utilized for running iterative scientific computing algorithms in the cloud, but most solutions proved lacking in some respect. This is why we &#8230;<\/p>\n","protected":false},"author":819,"featured_media":0,"parent":73,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_acf_changed":false,"inline_featured_image":false,"footnotes":""},"class_list":["post-119","page","type-page","status-publish","hentry"],"acf":[],"_links":{"self":[{"href":"https:\/\/sisu.ut.ee\/mc\/wp-json\/wp\/v2\/pages\/119","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sisu.ut.ee\/mc\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/sisu.ut.ee\/mc\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/sisu.ut.ee\/mc\/wp-json\/wp\/v2\/users\/819"}],"replies":[{"embeddable":true,"href":"https:\/\/sisu.ut.ee\/mc\/wp-json\/wp\/v2\/comments?post=119"}],"version-history":[{"count":1,"href":"https:\/\/sisu.ut.ee\/mc\/wp-json\/wp\/v2\/pages\/119\/revisions"}],"predecessor-version":[{"id":228,"href":"https:\/\/sisu.ut.ee\/mc\/wp-json\/wp\/v2\/pages\/119\/revisions\/228"}],"up":[{"embeddable":true,"href":"https:\/\/sisu.ut.ee\/mc\/wp-json\/wp\/v2\/pages\/73"}],"wp:attachment":[{"href":"https:\/\/sisu.ut.ee\/mc\/wp-json\/wp\/v2\/media?parent=119"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}