Cascading’s Logparser example in Clojure
July 2nd, 2009 • Uncategorized • No comments
I’ve long been interested in working with Cascading but didn’t relish the thought of jumping back into Java. Thankfully with the arrival of Clojure I can now happily play in my less-typing world.
I wanted to get a basic Cascading demo running so I took the Logparser example from the distribution and ported it over to Clojure. Here’s a zip (casclojure.zip) with the Ant build.xml, the Clojure source, the directory layout, etc. I’ll leave it up to you to get the required jars & libs–I’m using Hadoop 0.18.3 and Cascading 1.0.11. The build.xml has the usual config stuff for setting up where you keep your jars. I gave up trying to be smart about managing jars so I now dump everything into a single dir. Sure makes things like Ant config easier….
Incidentally, this is a perfect example of standing on the shoulders of genius and/or the sufficiently motivated. Most of the build.xml I picked up from Kyle Burton’s “Creating Executable Jars For Your Clojure Application“. (He has some other interesting Clojure posts talking about Incanter & Quartz scheduler btw.)
Clojure is concise. This example isn’t particularly idiomatic Clojure however; you wouldn’t use CamelCase for instance. But it does the job. (Also, gross, definitely need a syntax highlighter now.)
(ns logparser.app
(:gen-class)
(:import
(java.util Properties)
(cascading.flow Flow FlowConnector)
(cascading.operation.regex RegexParser)
(cascading.pipe Each Pipe)
(cascading.scheme TextLine)
(cascading.tap Hfs Lfs Tap)
(cascading.tuple Fields)
(org.apache.log4j Logger)))
(def apacheFields (new Fields (into-array ["ip" "time" "method" "event" "status" "size"])))
(def apacheRegex
"^([^ ]*) +[^ ]* +[^ ]* +\\[([^]]*)\\] +\\\"([^ ]*) ([^ ]*) [^ ]*\\\" ([^ ]*) ([^ ]*).*$")
(def allGroups (int-array [1 2 3 4 5 6]))
(def parser (new RegexParser apacheFields apacheRegex allGroups))
(def importPipe (new Each "parser" (new Fields (into-array ["line"])) parser))
(def properties (new Properties))
(FlowConnector/setApplicationJarClass properties logparser.app)
(defn -main [& args]
(let [localLogTap (new Lfs (new TextLine) (first args))
remoteLogTap (new Hfs (new TextLine) (last args))
parsedLogFlow (. (new FlowConnector properties) connect localLogTap remoteLogTap importPipe)]
(dorun
(. parsedLogFlow start)
(. parsedLogFlow complete))))
Items to note:
1. Having to use into-array, int-array, etc. Just a few Java-isms.
2. Roughly, Hadoop needs a Jar with a Main. Clojure uses :gen-class to handle that sorta thing. :gen-class at first seemed really complicated. Turns out it is and it isn’t. Keep it simple and no problem! For my purposes I just needed to make sure I had a -main definition.
3. You’ll see logparser.app everywhere. It’s the tie the binds this all together. Most of this exercise was really about getting a common namespace set up for everything: the Clojure, the build.xml, the Jar contents, the runtime environment.
From the build.xml the key tasks are the compile and the jar-with-manifest. They demonstrate what needs to happen to make Clojure compilation possible and to make a Hadoop-happy Jar.
If you’re following along with Cascading’s Gentle Introduction you can use your newly generated Jar in place of the one mentioned:
hadoop jar ./build/logparser-0.1.jar data/apache.200.txt output
I’ve only run this in the local Hadoop mode, no distribution, no cluster. Running it on a cluster will be for another post perhaps.
I hope this helps. Let me know otherwise. I’ll try to help but I’m far from an expert on any of this.
Another reason I like Clojure
June 25th, 2009 • Uncategorized • No comments
Exploring new packages of code, say the latest jar from Echo Nest, is incredibly easy and just plain fun. For instance:
(import '(com.echonest.api.v3.artist Artist ArtistAPI DocumentList)) (def artist-api (new ArtistAPI "YOUR DEV KEY HERE")) (def hot-artists (. artist-api getTopHotttArtists 10)) (println (.. (first hot-artists) getItem getName))
Will spit out “Papercuts” (as of the time of this writing), the current #1 artist on Echo Nest’s Hottt list.
I started looking for a nice plugin for Wordpress that would’ve done some pretty syntax highlighting on the Clojure code but it was taking too long. Spent less time writing up the Clojure actually.
Big in Twitter offline
June 25th, 2009 • Uncategorized • No comments
UPDATE: Fixed!
Something broke during a recent OS upgrade. Hopefully will have it back up sometime today. Thanks.
Links June 19th
June 19th, 2009 • Uncategorized • Comments Off
- Adding Google Maps To Your Rails Applications - "In the months following publication of the final part of the very popular series on integrating Google Maps into PHP applications, I've spent quite a bit of time working with another popular Web technology: Ruby on Rails. As it turns out, Rails developers have been hard at work creating a few amazing plugins capable of adding powerful mapping capabilities to your applications. In this new series, I'll introduce you to these powerful plugins, showing you a number of tips and tricks along the way."
- Geocoding with the Rails GeoKit Plugin - "In the opening installment of this series, you learned how to integrate Google Maps into your Rails applications by using the powerful YM4R/GM plugin. YM4R/GM makes it trivial to add maps and map features such as icons and information windows, but requires you to first derive the latitudinal and longitudinal coordinates of the desired locations. But what if you don't know these coordinates? Enter the GeoKit plugin."
- HubLog: Yahoo! PlaceMaker - "As an entity extraction service - the actual service, ignoring the quality of the results - Yahoo! PlaceMaker is excellent. It takes plain text or XML as input, and returns a list of places identified within the text. For XML input, it even returns an XPath and offsets for each entity so you can easily splice the annotations back into the document. Here's somewhere you can test the extraction, though it's only marking the centre of extracted locations, not the whole region."
- Geokit for Ruby & Rails: home - "Rails + Geokit = easier mapping apps. Geokit provides geocoding, location finders, distance calculations, and more."
- Togaware: Rattle: Gnome Cross Platform GUI for Data Mining using R - "Rattle: Gnome Cross Platform GUI for Data Mining using R" might be the closest thing to a Weka-style collection of tools
- Book Recommendations « Statistical Programming with Clojure -
A few examples of Digital meeting Real Life
June 12th, 2009 • Uncategorized • No comments
To hear the Sufjan Stevens’s track “The Lonely Man of Winter” you have to pack some headphones and travel to Brooklyn. There you will find a one Alec Duffy, contest winner and owner of the exclusive rights to the song.
Duffy didn’t think uploading it and sharing it was special enough. To hear it you need to visit him (and his wife) and plug your headphones into his iPod. In person. Intimate. That’s special.
Much has been said about this. The WSJ article “Not-So-Easy Listening: It Takes a Trek to Hear This Track” has plenty of examples.
So far 60 people have visited Duffy. 60 people traveling to listen to roughly 4.5MB of bits.
You’ve trained all year, you’re ready. You show up for the regional CrossFit Games qualifier and your first competition involves hand stand push ups. You suck at those. You don’t place.
There is a Plan B! The Last Chance Online Qualifier!
At the provided start time, you have 24 hrs to record video of you doing the Online Qualifier workouts and upload to a public forum, say Vimeo or YouTube. Once uploaded you then have to email your times and video links to the judges. On Vimeo, you’ll find 91 videos, YouTube has about 100.
Nearly 200 people taking the time to record & upload video for a chance to compete in a very strenuous fitness competition.
Digital stuff, sharing with a few or sharing with many. Fantastic.
Links June 8th
June 8th, 2009 • Uncategorized • Comments Off
- Design Observer - "New books continue to pile up at Design Observer. Again, we want to share the thirty-plus recently published titles we've received over the past few months. We are happy to support the many design writers and publishers who create these books. No criticism or commentary here: but a sense of the times through the books we've received. Maybe you'll find a surprise or two…"
- Port Map: Simple port mapping for your router Review | Networking | Mac Gems | Macworld -
- Jonathan Mendez's Blog: API Battle Plans: Fighting for Next - "We have reached maturation point with APIs where the three core components of the web experience – content, utility & data – are becoming readily available via API delivery. The implication of this growth is nothing less than the next web. A smarter web that delivers improved relevance, a better experience and expanded revenue generation opportunities. As the ramifications of these benefits become understood businesses now have no choice but to support an API superstructure the pillars of which are API content, utility & development and analytics."
- A Taste of 2.8: The Interactive Interpreter (REPL) | The Scala Programming Language - "The Scala Interpreter (often called a REPL for Read-Evaluate-Print Loop) sits in an unusual design space - an interactive interpreter for a statically typed language straddles two worlds which historically have been distinct. In version 2.8 the REPL further exploits the unique possibilities. [...] On startup, the REPL analyzes your classpath and creates a database of every visible package and path. This is available via tab-completion analagous to the path-completion available in most shells. If you type a partial path, tab will complete as far as it can and show you your options if there is more than one." one thing i sorta miss with clojure, the free stuff you get with types. don't miss it enough however.
- 500 Internal Server Error - 500 Internal Server Error
Links June 4th
June 4th, 2009 • Uncategorized • Comments Off
- Visualizing AMQP Broker Behavior with Clojure and Incanter - "I started the process of mocking up and automating the tests with the goal to help us figure out what we want to test. I chose to do this with Clojure and to use Incanter for a quick and easy visualization."
- Art And Design Short Courses - "Art & Design Short Courses is an independent short course school
delivering quality teaching and intensive learning experiences for
the individual. Because we regard each student as an individual,
our courses provide an important opportunity for individuals to
develop their personal and professional skills." - HotPads Daily :: The Official Blog of HotPads.com: HotPads on AWS - "HotPads abandoned our managed hosting in December and took the leap over to EC2 and its siblings. The presentation has a lot of detail on costs and other things to watch out for, so if you're currently planning your "cloud" architecture, you'll find some of this really helpful." video & slides
- 6.4. Introduction to Time Series Analysis - good overview, nice example of exponential smoothing methods
- Trazzler Buzz - Trazzler - "The Trazzler Buzz list is created from the volumes of information being transmitted to Twitter every second about 10,000 spots in 50 cities, plus festivals and outdoor destinations all over the world. We rank the list according to a formula that measures volume and recent activity on Twitter. The buzz list is the ultimate source of research on where people are going (or want to go) right now. Trazzler.com goes beyond the buzz and transports you to these places—and makes savvy recommendations based on your preferences—so you can find out where you should go on your next trip"
Links May 24th
May 24th, 2009 • Uncategorized • Comments Off
- La Clojure - JetBrains IntelliJ IDEA Plugin Repository - vim is working out but it is nice to know this is there as well
- Secret of Googlenomics: Data-Fueled Recipe Brews Profitability - "I'm going to talk about online auctions," says Hal Varian, the session's first speaker. Varian is a lanky 62-year-old professor at UC Berkeley's Haas School of Business and School of Information, but these days he's best known as Google's chief economist. This morning's crowd hasn't come for predictions about the credit market; they want to hear about Google's secret sauce.
- 10 Lessons in Treemap Design: Juice Analytics -
- TwitterAlikeExample - redis - Google Code - "A case study: Design and implementation of a simple Twitter clone using only the Redis key-value store as database and PHP" tells a good story about building something on top of key value only
- Yahoo! Placemaker™ Beta - YDN - "Yahoo! Placemaker is a freely available geoparsing Web service. It helps developers make their applications location-aware by identifying places in unstructured and atomic content – feeds, web pages, news, status updates – and returning geographic metadata for geographic indexing and markup."
- Finding Friends With MapReduce - Steve Krenzel - one of the better mapreduce explanations i've seen
The Colors Of Verner Panton from Colourlovers
May 18th, 2009 • Uncategorized • No comments

“Verner Panton is considered one of Denmark’s most influential 20th-century furniture and interior designers. During his career, he created innovative and futuristic designs in a variety of materials, especially plastics, and in vibrant colors.” Check out the entire post; loaded with fantastic images.
Links May 18th
May 18th, 2009 • Uncategorized • Comments Off
- Developers Home - MetaCarta Web Services -
- Text Mining Application Programming (Programming Series): Manu Konchady -
- pyTivo - "pyTivo is both an HMO and GoBack server. Similar to TiVo Desktop pyTivo loads many standard video compression codecs and outputs mpeg2 video to the TiVo. However, pyTivo is able to load MANY more file types than TiVo Desktop."
- Common Java Cookbook - many chapters totally unnecessary with clojure incidentally
- Mathematica Cookbook | O'Reilly Media - remind me to buy this in october.
- Paul Dix Explains Nothing: Breath fire over HTTP in Ruby with Typhoeus - "Typhoeus is a fearsome Ruby library that enables parallel HTTP requests while cleanly encapsulating handling logic. Specifically, it uses libcurl and libcurl-multi to run HTTP really fast. Further, it's designed with the focus of creating client libraries that work with web services. These could be external services like Twitter or systems like CouchDB and SimpleDB or custom web services that you write yourself." conflicted: big fan of libcurl, not such a big fan of ruby
- Scripting the Vim editor, Part 1: Variables, values, and expressions - "Vimscript is a mechanism for reshaping and extending the Vim editor. Scripting allows you to create new tools, simplify common tasks, and even redesign and replace existing editor features. This article (the first in a series) introduces the fundamental components of the Vimscript programming language: values, variables, expressions, statements, functions, and commands. These features are demonstrated and explained through a series of simple examples."