Cascading’s Logparser example in Clojure

I’ve long been interested in working with Cascading but didn’t relish the thought of jumping back into Java. Thankfully with the arrival of Clojure I can now happily play in my less-typing world.

I wanted to get a basic Cascading demo running so I took the Logparser example from the distribution and ported it over to Clojure. Here’s a zip (casclojure.zip) with the Ant build.xml, the Clojure source, the directory layout, etc. I’ll leave it up to you to get the required jars & libs–I’m using Hadoop 0.18.3 and Cascading 1.0.11. The build.xml has the usual config stuff for setting up where you keep your jars. I gave up trying to be smart about managing jars so I now dump everything into a single dir. Sure makes things like Ant config easier….

Incidentally, this is a perfect example of standing on the shoulders of genius and/or the sufficiently motivated. Most of the build.xml I picked up from Kyle Burton’sCreating Executable Jars For Your Clojure Application“. (He has some other interesting Clojure posts talking about Incanter & Quartz scheduler btw.)

Clojure is concise. This example isn’t particularly idiomatic Clojure however; you wouldn’t use CamelCase for instance. But it does the job. (Also, gross, definitely need a syntax highlighter now.)

(ns logparser.app
  (:gen-class)
  (:import
     (java.util Properties)
     (cascading.flow Flow FlowConnector)
     (cascading.operation.regex RegexParser)
     (cascading.pipe Each Pipe)
     (cascading.scheme TextLine)
     (cascading.tap Hfs Lfs Tap)
     (cascading.tuple Fields)
     (org.apache.log4j Logger)))
(def apacheFields (new Fields (into-array ["ip" "time" "method" "event" "status" "size"])))
(def apacheRegex
  "^([^ ]*) +[^ ]* +[^ ]* +\\[([^]]*)\\] +\\\"([^ ]*) ([^ ]*) [^ ]*\\\" ([^ ]*) ([^ ]*).*$")
(def allGroups (int-array [1 2 3 4 5 6]))
(def parser (new RegexParser apacheFields apacheRegex allGroups))
(def importPipe (new Each "parser" (new Fields (into-array ["line"])) parser))
(def properties (new Properties))
(FlowConnector/setApplicationJarClass properties logparser.app)
(defn -main [& args]
 (let [localLogTap (new Lfs (new TextLine) (first args))
       remoteLogTap (new Hfs (new TextLine) (last args))
       parsedLogFlow (. (new FlowConnector properties) connect localLogTap remoteLogTap importPipe)]
 (dorun
  (. parsedLogFlow start)
  (. parsedLogFlow complete))))

Items to note:

1. Having to use into-array, int-array, etc. Just a few Java-isms.

2. Roughly, Hadoop needs a Jar with a Main. Clojure uses :gen-class to handle that sorta thing. :gen-class at first seemed really complicated. Turns out it is and it isn’t. Keep it simple and no problem! For my purposes I just needed to make sure I had a -main definition.

3. You’ll see logparser.app everywhere. It’s the tie the binds this all together. Most of this exercise was really about getting a common namespace set up for everything: the Clojure, the build.xml, the Jar contents, the runtime environment.

From the build.xml the key tasks are the compile and the jar-with-manifest. They demonstrate what needs to happen to make Clojure compilation possible and to make a Hadoop-happy Jar.

If you’re following along with Cascading’s Gentle Introduction you can use your newly generated Jar in place of the one mentioned:

hadoop jar ./build/logparser-0.1.jar data/apache.200.txt output

I’ve only run this in the local Hadoop mode, no distribution, no cluster. Running it on a cluster will be for another post perhaps.

I hope this helps. Let me know otherwise. I’ll try to help but I’m far from an expert on any of this.

This entry was posted in clojure and tagged , . Bookmark the permalink. Both comments and trackbacks are currently closed.