1. 程式人生 > 其它 >【WDL】8. 實踐:本地/叢集執行

【WDL】8. 實踐:本地/叢集執行

目錄
WDL可以使用本地、叢集、雲端三種模式來跑流程,本地執行是不需要伺服器後臺配置檔案,而另外兩種需要配置檔案。

本地執行

下載cromwell和womtool到本地伺服器,地址:https://github.com/broadinstitute/cromwell/releases
不建議下載最新版本,我試了下最新的version 78是報錯的,好像是java版本的匹配問題。

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/hsqldb/jdbcDriver has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0

我這裡以version 51為例。

示例一

編寫echo.wdl

workflow wf_echo {
  call echo
  output {
    echo.outFile
    echo.content
  }
}

task echo {
  String out
  command {
    echo Hello World! > ${out}
  }

  output {
    File outFile = "${out}"
    Array[String] content = read_lines(outFile)
  }
}

womtool校驗WDL:

java -jar womtool-51.jar validate echo.wdl

顯示Success!

生成json:

java -jar womtool-51.jar inputs echo.wdl >echo.json

修改echo.json內容,配置輸入檔案:

{
  "wf_echo.echo.out": "hello_world"
}

cromwell執行WDL指令碼:

java -jar cromwell-51.jar run echo.wdl --inputs echo.json

注意檢視執行狀態status是 'Succeeded'還是'Failed'。

流程執行完畢預設會在執行流程的目錄下生成兩個目錄,cromwell-executions和cromwell-workflow-logs分別是執行步驟和log目錄。cromwell-executions目錄結構如下:

wf_echo/
└── d62e94fe-372d-434c-abcb-144036f26935
    └── call-echo
        ├── execution
        │   ├── hello_world
        │   ├── rc
        │   ├── script
        │   ├── script.background
        │   ├── script.submit
        │   ├── stderr
        │   ├── stderr.background
        │   ├── stdout
        │   └── stdout.background
        └── tmp.d25a3769

每次執行都會生成一串字串目錄(不會覆蓋之前的結果),每個task都有類似的目錄結果。私以為執行速度很慢(呼叫的東西很多),且過程檔案太多了!

目標結果:

$ cat hello_world
Hello World!

示例二

一個稍微複雜點的例子,並行多輸出。看看它的結果目錄。
test.wdl:

workflow testwdl {
     Int? thread = 6
     String varwdl
     String prefix
     Array[Int] intarray = [1,2,3,4,5]

     if(thread>5) {
        call taska {
            input:
            vara = varwdl,
            infile = taskb.outfile,
            prefix = prefix
        }
     }

     scatter (sample in intarray) {
          call taskb {
               input:
                    varb = 'testb',
                    thread = thread,
                    prefix = sample
          }
     }
}

task taska {
    String vara
    Array[File] infile
    String prefix

    command {
           cat ${sep=" " infile} >${prefix}_${vara}.txt
    }
}

task taskb {
    String varb
    Int thread
    String prefix

    command {
           echo ${varb} ${thread} >${prefix}.txt
    }

    output {
         File outfile = '${prefix}.txt'
    }
}

test.json:

{
  "testwdl.varwdl": "hellowdl",
  "testwdl.prefix": "testwdl"
}

執行java -jar cromwell-51.jar run test.wdl --inputs test.json

生成的目錄結果:

23ab84c5-f219-4f2d-852f-677df6811a0b
├── call-taska
│   ├── execution
│   │   ├── rc
│   │   ├── script
│   │   ├── script.background
│   │   ├── script.submit
│   │   ├── stderr
│   │   ├── stderr.background
│   │   ├── stdout
│   │   ├── stdout.background
│   │   └── testwdl_hellowdl.txt
│   ├── inputs
│   │   ├── -1507720077
│   │   │   └── 3.txt
│   │   ├── 2086182641
│   │   │   └── 1.txt
│   │   ├── 289231282
│   │   │   └── 2.txt
│   │   ├── -806655499
│   │   │   └── 5.txt
│   │   └── 990295860
│   │       └── 4.txt
│   └── tmp.de320778
└── call-taskb
    ├── shard-0
    │   ├── execution
    │   │   ├── 1.txt
    │   │   ├── rc
    │   │   ├── script
    │   │   ├── script.background
    │   │   ├── script.submit
    │   │   ├── stderr
    │   │   ├── stderr.background
    │   │   ├── stdout
    │   │   └── stdout.background
    │   └── tmp.eba86162
    ├── shard-1
    │   ├── execution
    │   │   ├── 2.txt
    │   │   ├── rc
    │   │   ├── script
    │   │   ├── script.background
    │   │   ├── script.submit
    │   │   ├── stderr
    │   │   ├── stderr.background
    │   │   ├── stdout
    │   │   └── stdout.background
    │   └── tmp.658f2d2f
    ├── shard-2
    │   ├── execution
    │   │   ├── 3.txt
    │   │   ├── rc
    │   │   ├── script
    │   │   ├── script.background
    │   │   ├── script.submit
    │   │   ├── stderr
    │   │   ├── stderr.background
    │   │   ├── stdout
    │   │   └── stdout.background
    │   └── tmp.ae04eda0
    ├── shard-3
    │   ├── execution
    │   │   ├── 4.txt
    │   │   ├── rc
    │   │   ├── script
    │   │   ├── script.background
    │   │   ├── script.submit
    │   │   ├── stderr
    │   │   ├── stderr.background
    │   │   ├── stdout
    │   │   └── stdout.background
    │   └── tmp.bcfe9d45
    └── shard-4
        ├── execution
        │   ├── 5.txt
        │   ├── rc
        │   ├── script
        │   ├── script.background
        │   ├── script.submit
        │   ├── stderr
        │   ├── stderr.background
        │   ├── stdout
        │   └── stdout.background
        └── tmp.2e004f34

叢集執行

cromwell 不僅支援本地計算機任務排程,同時支援叢集/雲端計算作業管理系統,只需要進行簡單配置,就可以實現大規模計算。
官方針對不同的叢集/雲作業管理系統提供了相關的配置檔案(https://github.com/broadinstitute/cromwell/tree/develop/cromwell.example.backends),但是本質都是講排程命令嵌入其中。

SGE配置:backend.conf

include required(classpath("application"))

backend {
  default = SGE
  # sge config
  providers {
    SGE {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {

        # Limits the number of concurrent jobs
        concurrent-job-limit = 50

        # Warning: If set, Cromwell will run 'check-alive' for every job at this interval
        # exit-code-timeout-seconds = 120

        runtime-attributes = """
        Int cpu = 8
        Float? memory_gb
        String? sge_queue
        String? sge_project
        """

        submit = """
        qsub \
        -terse \
        -N ${job_name} \
        -wd ${cwd} \
        -o ${out}.out \
        -e ${err}.err \
        ${"-pe smp " + cpu} \
        ${"-l mem_free=" + memory_gb + "g"} \
        ${"-q " + sge_queue} \
        ${"-P " + sge_project} \
        ${script}
        """

        kill = "qdel ${job_id}"
        check-alive = "qstat -j ${job_id}"
        job-id-regex = "(\\d+)"

        # filesystem config
        filesystems {
          local {

            localization: [
               "hard-link","soft-link", "copy"
              ]

            caching {
              duplication-strategy: [
              "hard-link","soft-link",  "copy"
              ]

              # Default: "md5"
              hashing-strategy: "md5"

              # Default: 10485760 (10MB).
              fingerprint-size: 10485760

              # Default: false
              check-sibling-md5: false
            }
          }
        }
      }
    }
  }
}

提交命令:
java -Dconfig.file=backend.conf -jar cromwell-51.jar run test.wdl --inputs test.json

若有Docker,也需要配置,示例如下:

dockerRoot=/cromwell-executions
backend {
  default = Docker

  providers {

    # Example backend that _only_ runs workflows that specify docker for every command.
    Docker {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        run-in-background = true
        runtime-attributes = "String docker"
        # 嵌入 docker 的執行命令
        # docker_cwd 通過 dockerRoot(預設 /cromwell-executions) 設定, 與當前目錄(${cwd})下 ./cromwell-executions 相對應
        submit-docker = "docker run --rm -v ${cwd}:${docker_cwd} -i ${docker} /bin/bash < ${docker_script}"
      }
    }
}

關於雲端的配置,運營商基本上已經配好了,我們只需要只用它的介面即可,不行就找技術支援。

Ref:
https://www.jianshu.com/p/b396f9fc15e9
https://www.jianshu.com/p/91a4d799bde5
https://zhuanlan.zhihu.com/p/417633670
https://github.com/broadinstitute/cromwell/blob/develop/cromwell.example.backends/cromwell.examples.conf