Cascading’s Logparser example in Clojure
July 2nd, 2009 • Uncategorized
I’ve long been interested in working with Cascading but didn’t relish the thought of jumping back into Java. Thankfully with the arrival of Clojure I can now happily play in my less-typing world.
I wanted to get a basic Cascading demo running so I took the Logparser example from the distribution and ported it over to Clojure. Here’s a zip (casclojure.zip) with the Ant build.xml, the Clojure source, the directory layout, etc. I’ll leave it up to you to get the required jars & libs–I’m using Hadoop 0.18.3 and Cascading 1.0.11. The build.xml has the usual config stuff for setting up where you keep your jars. I gave up trying to be smart about managing jars so I now dump everything into a single dir. Sure makes things like Ant config easier….
Incidentally, this is a perfect example of standing on the shoulders of genius and/or the sufficiently motivated. Most of the build.xml I picked up from Kyle Burton’s “Creating Executable Jars For Your Clojure Application“. (He has some other interesting Clojure posts talking about Incanter & Quartz scheduler btw.)
Clojure is concise. This example isn’t particularly idiomatic Clojure however; you wouldn’t use CamelCase for instance. But it does the job. (Also, gross, definitely need a syntax highlighter now.)
(ns logparser.app
(:gen-class)
(:import
(java.util Properties)
(cascading.flow Flow FlowConnector)
(cascading.operation.regex RegexParser)
(cascading.pipe Each Pipe)
(cascading.scheme TextLine)
(cascading.tap Hfs Lfs Tap)
(cascading.tuple Fields)
(org.apache.log4j Logger)))
(def apacheFields (new Fields (into-array ["ip" "time" "method" "event" "status" "size"])))
(def apacheRegex
"^([^ ]*) +[^ ]* +[^ ]* +\\[([^]]*)\\] +\\\"([^ ]*) ([^ ]*) [^ ]*\\\" ([^ ]*) ([^ ]*).*$")
(def allGroups (int-array [1 2 3 4 5 6]))
(def parser (new RegexParser apacheFields apacheRegex allGroups))
(def importPipe (new Each "parser" (new Fields (into-array ["line"])) parser))
(def properties (new Properties))
(FlowConnector/setApplicationJarClass properties logparser.app)
(defn -main [& args]
(let [localLogTap (new Lfs (new TextLine) (first args))
remoteLogTap (new Hfs (new TextLine) (last args))
parsedLogFlow (. (new FlowConnector properties) connect localLogTap remoteLogTap importPipe)]
(dorun
(. parsedLogFlow start)
(. parsedLogFlow complete))))
Items to note:
1. Having to use into-array, int-array, etc. Just a few Java-isms.
2. Roughly, Hadoop needs a Jar with a Main. Clojure uses :gen-class to handle that sorta thing. :gen-class at first seemed really complicated. Turns out it is and it isn’t. Keep it simple and no problem! For my purposes I just needed to make sure I had a -main definition.
3. You’ll see logparser.app everywhere. It’s the tie the binds this all together. Most of this exercise was really about getting a common namespace set up for everything: the Clojure, the build.xml, the Jar contents, the runtime environment.
From the build.xml the key tasks are the compile and the jar-with-manifest. They demonstrate what needs to happen to make Clojure compilation possible and to make a Hadoop-happy Jar.
If you’re following along with Cascading’s Gentle Introduction you can use your newly generated Jar in place of the one mentioned:
hadoop jar ./build/logparser-0.1.jar data/apache.200.txt output
I’ve only run this in the local Hadoop mode, no distribution, no cluster. Running it on a cluster will be for another post perhaps.
I hope this helps. Let me know otherwise. I’ll try to help but I’m far from an expert on any of this.