Haskell Data Analysis and Machine Learning Cookbook

IanCal · on July 29, 2016

I like seeing more things come out helping people with data analysis in various languages, but this put me off:

> isWhitespace x = elem x " \t\r\n"

This is the kind of thing that makes me concerned about using this resource for real-world data. In real-world data you're going to get all kinds of crazy things coming in, and if you're assuming nobody will ever have something like a zero width non-breaking space, or a form feed, you're going to have a problem.

It's the kind of thing I see with people starting out when dealing with data, similarly the punctuation detection here: https://github.com/BinRoot/Haskell-Data-Analysis-Cookbook/bl...

If you rely on these things, you will have problems. Text is hard and weird and terribly more complicated than people usually expect.

Does haskell have good libraries for dealing with the more awkward parts? Can I easily remove all characters marked as whitespace in unicode, for example? Detecting and managing mangled encodings?

zallarak · on July 29, 2016

That line of code was demonstrative. The actual book uses 'Data.Char.IsSpace' which properly handled it.

"Returns True for any Unicode space character, and the control characters \t, \n, \r, \f, \v."

Before you chastise them for not handling something, verify it. I'm not affiliated with the book but you probably deterred people from buying it.

https://github.com/BinRoot/Haskell-Data-Analysis-Cookbook/bl...

IanCal · on July 30, 2016

Well for one the line shouldn't really be one of the first bits of advertising for the book if the author also knows it's wrong. The second example is taken from the GitHub repo for the book though, and is exactly the same type of error.

> you probably deterred people from buying it.

Quite possibly, but I think with good reason. I don't know what's in the book but I'm concerned it won't contain things like a discussion of what whitespace is and is not, how to decide what you should do for your data and when isSpace might not do what you really need. I can't review it properly but at least one bit of code in the repo looks dangerous and one bit on the website looks dangerous.

kornish · on July 29, 2016

I agree that line of code was a turn-off because of its oversimplification.

> That line of code was demonstrative.

Why demonstrate with something completely unrepresentative of the actual contents, then?

ocharles · on July 29, 2016

For more realistic stuff, I generally use https://hackage.haskell.org/package/text-icu. For example, in an old project we had a type called `Texty`, which was roughly non-empty normalized text:

  mkTexty :: Text -> Maybe Texty
  mkTexty = \a ->
      let a' = T.dropAround Char.isSpace (ICU.normalize ICU.NFC a)
      in if T.null a' then Nothing
                      else Just (Texty a')
  {-# INLINE mkTexty #-}

IanCal · on July 29, 2016

Great, thanks for the link.

_pvxk · on July 29, 2016

Yeah, that doesn't give high hopes for this book teaching best practices :/

If you don't want to depend on text-icu, there is support for more Unicode space characters in https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-C...

    import           Data.Char             (isAlpha, isLower, isSpace)
    import qualified Data.Text             as T

giving you e.g.

    textIsAllSpace t = T.all isSpace t

gh02t · on July 29, 2016

Important note for non-Haskellers: the library OP is talking about is part of the standard base libraries usually distributed with GHC. `text-icu`, while excellent, is not.

edgordon · on July 29, 2016

This is a repost of a book we published in 2014. It's well reviewed, and the author kept the Github repo for the code up to date with feedback (https://github.com/BinRoot/Haskell-Data-Analysis-Cookbook). If anyone wants to try it, you can pick it up in the Packt sale currently for $10 @ packtpub.com

sdx23 · on July 29, 2016

I have bought that book some years ago and wouldn't do so again. It was very disappointing to see neither a focus on Haskell nor on data analysis. It scratches both topics but covers only very elementary things. The content is mostly short receipts that were to me of no value at the time.

For people interested in the topics I'd recommend to buy some other good books, on Haskell and data analysis separately.

If, however, receipts are to your liking and you're only starting out with Haskell / data science maybe this is something for you (or maybe not).

blubb-fish · on July 29, 2016

Serious question:

Is there any reason why somebody would use Haskell for data analysis when there is also R and Python - which are perfect for that job - except for that the respective person happens to be a Haskell expert anyway?

gh02t · on July 29, 2016

I use all three, depending on the task. Haskell is compiled and relatively speedy, as well as being great for writing custom parsers (plus in a lot of situations it parallelizes very easily). One thing I find Haskell to be really useful for is coercing large unstructured datasets into a format that is easier to feed into Python.

For instance, I once had to write a parser for the data coming off of a digitizer being fed by a rather complicated radiation detector array. It's in an obscure and somewhat bizarre binary format that is pretty tedious to work with because it involves a lot of state in the parser, plus the files are pretty enormous (they describe every radioactive particle hitting every detector in the array over several hours of measurement). My colleague wrote a horribly tedious script in Python to parse it that was complicated and agonizingly slow, but I was able to write a very natural 80-line Haskell program in a few hours that was several orders of magnitude faster as well as much more robust. I was just massaging the data to feed into Python, but it was far far easier to do in Haskell.

So I find for some tasks I want to reach for Haskell, because it's natural to express the solution in Haskell. Other stuff I wouldn't even consider it, especially for stuff like data processing and plotting Haskell is not as elegant. Right tool for the job and all.

jb1991 · on July 29, 2016

This 80-line solution sounds very interesting. I'd love to see more examples of this particular strength of Haskell. Are there similar parsing projects that you can point to that are open source and worth looking at to better understand this use case of Haskell?

gh02t · on July 29, 2016

I remembered a bit wrong, the final version was 155 lines (I think my first stab was ~80 lines).

I have a version of it on Github as a gist https://gist.github.com/jasonmhite/c4c56d4c50fc673e658b71b82... . Can't really claim whether or not it's "good" Haskell or that it's optimal, but it gets the job done. I also dunno how comprehensible it is without knowing the ins-and-outs of the nPOD format or multiplicity counting, but hey there it is.

Not sure of what the definitive reference is, but this article is pretty good I think:

https://www.schoolofhaskell.com/school/starting-with-haskell...

PinkyThePiggy · on July 29, 2016

Here is an example from Real World Haskell (pretty good intermediate book, although it is a bit dated): http://book.realworldhaskell.org/read/using-parsec.html#csv

A fully featured CSV parser in 20-30 [depending on how you count] lines of non-golf code.

jb1991 · on July 29, 2016

I'm curious what aspects of Haskell have changed to cause this book to be dated, i.e. are there examples in this book which would be done differently now?

luisfoliv · on July 29, 2016

This SO thread is very informative in this regard: http://stackoverflow.com/questions/23727768/which-parts-of-r...

PinkyThePiggy · on July 29, 2016

As one example, a lot of the libraries referenced in the book have gone through major version changes (1.x to 2.x etc.) that change the functions being exported and/or architecture reworks changing which libraries you need to import which means that if you try to copy + compile a lot of the code snippets, they will not compile. It is generally trivial changes in most cases, but if you are a newbie to the language, it is likely not immediately obvious because you will think you copied from the book wrong (or that the book has a typo) instead of the libraries having been changed.

There is also a lot written about historical quirks you will come across that have since been fixed.

e19293001 · on July 29, 2016

Thanks for sharing your experience with Haskell. Could you please point us out some reference where you've learned in creating a parser using Haskell? Any of the blogs, tutorials, articles, books would be great.

albertus · on July 29, 2016

Not in the context of data analysis, but the Write Yourself a Scheme in 48 Hours[0] wikibook is a good exercise in working with Parsec. It's written for Haskell beginners, but it wasn't tedious in that regard.

[0] https://en.wikibooks.org/wiki/Write_Yourself_a_Scheme_in_48_...

wrsh07 · on July 29, 2016

Did you see this post about a partial C to Rust transpiler written in literal Haskell? https://github.com/jameysharp/corrode/blob/master/src/Langua...

p4wnc6 · on July 29, 2016

Haskell as a language is excellent for these tasks and when you are familiar with it, you can be even more rapidly productive than with Python, and much, much more than with R.

The problem for me has not so much been any technical aspect of Haskell nor any technical benefits of Python or R. Instead, it has been sociological issues with the Haskell community.

I won't go into it too much, but the biggest one is that there is no cohesive way to understand what kind of progression of Haskell knowledge makes you "a beginner" or "intermediate" or "advanced" -- especially for the purposes of getting a job. You might have mastered all the basics from LYAH, and know monads inside and out, and then someone comes along and says you're totally a Haskell noob because you haven't used Template Haskell for 6 years, or you don't do everything with Monad transformers, or you don't have the API of some tool like Parsec committed to memory, or you don't use language extensions and LiquidHaskell to use the compiler as a proof system of the correctness of your code. In Haskell, you're always made to feel like you're constantly a fuck-up for not knowing the next great wrinkle of abstraction or the next great toolkit up the food chain.

No joke, I've experienced being called a Haskell "beginner" because I was not deeply familiar with LiquidHaskell. That's not a common opinion, but it gives you a sense of the variety. This stresses me out so much that I don't even bother applying to Haskell jobs anymore. I don't want to get an initial phone screen and then just be made to feel like I'm a dunce because I don't know how some category theory principle is embodied by phantom types or something. Yuck.

Getting Haskell jobs that will actually pay you according to your experience and ability to learn is very hard. Most places just don't want to hire people who aren't super experienced in Haskell, and they default to believing everyone is a beginner unless they wrote a math Ph.D. thesis on multiparameter type classes.

On the off chance that someone will talk to you, and they think "OK, this person has a bunch of years of data analysis in Python or R under their belt, and they know enough about Haskell to work pretty quickly with Monads and basic type classes ..." that means you are a junior engineer at best, and will be paid like it, even if your data analysis skill is very high or you're a very fast learner or you have an advanced degree or many years of experience.

I've never been able to figure out the impenetrable bubble around Haskell jobs, but this kind of culture of believing that about 99% of Haskellers are beginners and less than 1% are anything beyond a beginner is the biggest reason why I mostly gave up learning Haskell or searching for Haskell jobs.

It's funny because a lot of people argue functional programming is too much about elegance to be pragmatic, and they can "just get stuff done" with other tools. Then lots of functional programmer enthusiasts jump in and refute that (I believe correctly) but then turn around and only hire people if they jump through all of the unpragmatic too-much-elegance hoops. It's really vexing.

zelos · on July 29, 2016

It definitely seems like there's a lack of intermediate-to-advanced Haskell books. This looks like it contains a lot of canonical Haskell coding examples and might be useful: any Haskell experts that can weigh in?

The other Packt Haskell book is apparently terrible, so I'm a bit cautious.

15155 · on July 29, 2016

This is my favorite resource: http://haskellbook.com/

It covers beginner topics through arguably advanced topics (monad transformers, etc.)

"Advanced" is subjective, but I don't believe the current version covers GADTs, type families, or other more esoteric extensions.

(No affiliation to the authors)

cm3 · on July 29, 2016

I've seen GADTs as a feature that Haskell devs complained about, similar to Template Haskell. I know the issues with TH, but what's the limitation of GADTs as implemented in Haskell, and are there languages with less problematic implementations?

tel · on July 29, 2016

Maybe some people think they're a little complex, but there's nothing particularly wrong with them in Haskell. They can be extended further in a dependently-typed context, but that's really another thing completely.

cies · on July 29, 2016

> I don't believe the current version covers GADTs, type families

Indeed it does not currently cover that.

mark_l_watson · on July 29, 2016

I bought this book a couple of years ago and never read through it. It does provide convenient recipes that you can look up in the table of contents. I like recipe books like this, but be warned that there is not much depth. I recently bought another book by the same author that is also useful.

dschiptsov · on July 29, 2016

BTW, one might notice how a language with type-tagged data (a value has a type, not a variable, there are no box-like variables, but bindings) is much more suitable for data exploration and analysis (python + pandas is a good example).

Also homogeneous lists and especially conditionals is kind of awkward - tuples aren't as flexible as lists. Defining a type for each possible row and then pattern-match on will result in a lot of useless boilerplate, almost as bad as Java.

Common Lisp, it seems, is a better choice for such problems.

reuben364 · on July 29, 2016

The dependently-typed-ish features of haskell can provide a sweet spot of both untagged data along with the flexibility of heterogeneous lists.

nine_k · on July 29, 2016

An example / reference would be nice.

eggy · on July 29, 2016

I might have a look, but the page seems to be a big ad with animated buy buttons at top?

I posted a link for a free online kdb training class 3 months ago for students that normally goes for $1300 with no affiliation to the company, and it was flagged. What's the difference?