More

arcfide · on March 6, 2017

Currently Dyalog APL provides green-threaded concurrent programming using a traditional "fork" model as well as a CSP-style distributed computing model called "Isolates." In addition, Dyalog will utilize vectorized operations on the CPU for a number of primitives and code patterns. Co-dfns provides for additional functionality by providing vector level parallelism on the GPU/CPU and a true threading "fork" (this feature is only part of the language extension, and has not yet been implemented). So, if you write your code in a Co-dfns compatible manner (closed single Namespace of dfns, and some limitations in the current version which are being lifted rapidly), then you will be able to execute it on the GPU, CPU, and SMP machines.

As a trivial example, say you wanted to compute the quadratic prime numbers algorithm in parallel:

    Primes←{(2=+⌿0=X∘.|X)/X←⍳⍵}

You would simply compile this function inside of a namespace using Co-dfns and you could then run this function on the GPU to improve performance. The compiled code would run on either the CPU, GPU, or SMP machines.

throwaway7645 · on March 7, 2017

It just does it automatically?

arcfide · on March 8, 2017

arcfide · on March 6, 2017

Here's an English reading of the above code:

Filter the numbers from [2,R), removing any number that is a member of the set of products of pairs of elements in [2,R).

arcfide · on March 6, 2017

See Andy Keep's dissertation on using Nanopass to reimplement the Chez Scheme compiler. The short answer is, not really. Nanopass can add some constant factors, but is essentially a linear/constant addition, whereas the majority of compilers are quadratic or greater in complexity. Chez Scheme was/is N log N in most cases, and even then Nanopass did not significantly harm performance.

It actually has benefits on the GPU beyond just the CPU, because the passes tend to be smaller, and thus more readily executed efficiently on the GPU, meaning that you would likely see a large net gain in performance by using Nanopass. Since I believe this might be the only compiler of its kind, it's hard to say for sure on that mark, though.

hzhou321 · on March 6, 2017

Does N here refer to the length of source code or the number of passes? When the source code is big, even the identity pass takes time, right? I don't understand how it become constant addition.

Could you quickly comment on that the majority of compilers being quadratic or greater?

By the benefits on the GPU, it refers to the size of the code for the passes or the size of the source code? I assume the passes are run serially, right?

[I'll read Andy Keep's paper.]

arcfide · on March 6, 2017

In compilers the N value is commonly considered to be the number of nodes in an AST. Compiler performance is generally evaluated based on its performance over the code, which is generally bound by the number of nodes in the AST.

The identity pass is linear in the size of the AST in almost all functional compiler designs. Optimizations would then make this constant time for most functional languages with adequate optimizations. The identity pass in Co-dfns would be the single character ⊢. The Co-dfns compiler would produce code for this pass which is constant time.

Likewise, with a bit of work, this would also be true for a Nanopass compiler function, unless you insisted on traversing the AST and rebuilding it yourself, in which case the framework could not optimize this away.

The majority of compilers use algorithms that are quadratic in the size of the AST (number of nodes) for a number of passes. Many compilers use a variety of graph coloring register allocation, which is NP-complete. However, better register allocation algorithms exist that are quadratic or quadratic log in the number of nodes in the AST. The Nanopass research has demonstrated that for real life compilers, the cost of certain compiler passes generally overwhelms most of the cost of the rest of the compiler, and so adding another linear factor by splitting a pass into multiple passes, generally does not affect the big-O complexity of the entire compiler, and thus, the overall performance at scale, though it can affect the constant factors by a small amount, depending on how good your allocation engine is for your host language.

This Nanopass research has only been done on compilers implemented on the CPU. My research is exploring whole compilers on the GPU, which is an area that shows promise, since even memory bound computations such as most compiler core compiler transformations, can see significant performance speedups, even if they move from a linear to log linear complexity.

The passes are not run serially. Each pass is data parallel, and while there is a serial dependency in that each pass is dependent on the results of the previous pass, nothing in the way the compiler is written forces a hard barrier (such as a join in fork-join programming) at the compiler pass boundaries. It is conceivable that optimizations techniques for stencils such as Trapezoidal decomposition could be applied to compilers written in this fashion to allow intra-pass optimization to occur, but this is future work.

The passes themselves execute in parallel on the GPU, or at least, that is their design. They are expressed as natively data parallel, or vectorized, algorithms. If run on the CPU through the interpreter, then they will execute using the Intel Vectorized code units using the optimized primitives in the Dyalog APL interpreter.

hzhou321 · on March 6, 2017

Thanks. Your work is inspiring.

arcfide · on March 6, 2017

When I spoke about benefits on the GPU, I was speaking of the benefits of executing the compiler passes on the GPU, because with a Nanopass architecture, or really, the Micropass architecture I'm using here, each pass is simpler than the equivalent set of passes that do the same thing in a traditional compiler. This means that the passes are generally each simpler, and thus, easier to parallelize.

arcfide · on March 6, 2017

Thanks for giving me a chance to chat a bit more!

FullyFunctional · on March 9, 2017

Thanks for sharing all these insights, it's definitely an eye opener, both regarding APL as well as conventional wisdom.

arcfide · on March 6, 2017

Thanks everyone for the great discussions and interest. I'm going to get back to this thread later, but won't be actively commenting for the rest of the day. Please feel free to continue discussion and continue asking questions and I'll get to them when/if I can.

Also, feel free to contact me directly to discuss anything at a higher bandwidth level. And as always, see the GitHub repository for more information on the project and ways that you can contribute (help out, fund the project, contribute code, bug reports, or patches).

https://github.com/arcfide/Co-dfns

If I can "shill" a little bit, you can fund the Open Source/Open Work aspects of this project through micro-funding at Gratipay, which helps to support and keep the Open Source and community aspects of the project going. Dyalog is a great company, but ultimately, as some have mentioned, it's the community that helps to keep these ideas moving forward and improving.

javajosh · on March 6, 2017

Does working on your project, or running it, require a Dyalog license?

arcfide · on March 6, 2017

You'll need to get at least a free copy of Dyalog to use the compiler. They have a number of different free options. If you intend to use the compiler, it is available for use by all under an AGPL license. However, the generated code also uses Co-dfns code and is likewise under an AGPL license. If that license is unsuitable for you, then you'll need to arrange for an alternative license through Dyalog. Normally this will only be required for people interested in using Co-dfns commercially in their closed source projects.

The Co-dfns work is run like an Open Company however. You are welcome to submit contributions (see the README for more information) as pull requests and can onboard yourself at your leisure. I'm working to encourage more people to support the funding of Open Work on Co-dfns, but that's a long, slow process.

So, yes, generally speaking the compiler is designed to be used in conjunction with the host of additional tool support you get from Dyalog APL (the interpreter). However, Dyalog makes it as easy as possible to get a copy of the interpreter.

javajosh · on March 6, 2017

If I understand correctly, you can use the free, non-commercial-use Dyalog to run the AGPL Co-dfns compiler to produce programs. And also, the Co-dfns output is not independent of the compiler itself, meaning that the output must also be AGPL.

I'm speculating, but I suspect that anyone using Co-dfns output will be running it on a private bank of GPUs over a large private data-set, so I suppose the AGPL won't ever matter for them, in practice.

arcfide · on March 6, 2017

If that works for them, more power to them! They'll still likely need a commercial Co-dfns license if they are doing that, not because of the produced code, but because of the programming environment in which they do the development, unless they want to slice Co-dfns out of its "host" environment.

You are right, however, that you can use the free, non-commercial Dyalog to run the AGPL Co-dfns. Co-dfns is both a commercial and a research enterprise, and so it is important to me that the ideas in the compiler are manifest openly to the public, hence the open-source nature of the project. However, it also needs to fund that research somehow (have you ever tried to get a grant to write APL? Hahahaha!), and so commercial uses need to be managed in a reasonable way, namely, something mutually equitable.

Some interested parties are working to use Co-dfns in public facing services, and not just private data sets. One group is very slow moving, but we're exploring the possibility of a rewrite of the cryptographic stack.

mkesper · on March 6, 2017

The AGPL is not a "non-commercial" license. Freedom zero guarantees you can use it for any purpose. AGPL only becomes relevant when distributing.

arcfide · on March 6, 2017

Try this one:

(2=+⌿0=X∘.|X)/X←⍳20

:-) They're equivalent, but the second should give you a bit more intuition based on your understanding of the definition of prime numbers.

Use TryAPL.org to explore the code. Evaluation is right to left with parentheses being their standard meaning. Try seeing what the left and the right sides of the / expression give you, and then try removing pieces from the left of the left side expression of / incrementally to see how its built up.

If you need help seeing what the functions do, you can see help.dyalog.com "Language Reference" That will show you what the symbols do.

arcfide · on March 6, 2017

You may also want to see "Game of Life in APL" on YouTube.

https://www.youtube.com/watch?v=a9xAKttWgP4

arcfide · on March 6, 2017

I would consider ArrayFire a "hard abstraction" that is basically the bottoming out of the system. Previous to this I was using OpenACC. See my talk at the Dyalog User Meeting Last year to see some of the struggles there with Scan.

In short, ArrayFire is handling a lot of the core GPU algorithms for me, and that's its main contribution. This prevents me from needing to implement these myself, because this is itself a full time job. Much better to put a hard cutoff point and work on the high-level questions, rather than spending time trying to support three different architectures by myself. There is basically no good way of working at that level that isn't ugly and very very painful. I am glad to have someone else do that for me.

The benefit is that I can take better advantage of their platform independent and performance tweaks for individual GPUs (which is still a lot of work with the state of GPU programming today). This leaves me to focus on higher level concerns that are more relevant to the APL programmer.

arcfide · on March 6, 2017

1. http://www.jsoftware.com/papers/fork.htm See Hook and Fork. This was the first introduction of Trains programming in APL, I think. You can see help.dyalog.com for more information of the current implementation in Dyalog.

2. I think the tree printout of trains is a built in library feature of Dyalog APL. The dfns library (dfns.dyalog.com) has a tree printing dfns that may be of use in implementing it, and I think you can read the actual source code for that in the user command for display or the like.

I may implement a version of this for my ASTs to augment the Xml and px utilities, as it is a nice format and one that might be of use for printing out the AST of the compiler. I have not had need of it yet, though, so I'll probably get to it the next time I have to do a fancy demo for which I can prepare more adequately.

breck · on March 6, 2017

Thanks! Also found this link which had more info on Trains: http://help.dyalog.com/14.1/Content/Language/Introduction/Tr...

arcfide · on March 6, 2017

Ah, there it is, thanks!

arcfide · on March 6, 2017

Thanks for the questions!

1. I've addressed this a bit more in the previous threads on this video and the discussion that led up to the video. It's actually less write once than other stuff. However, to go with that, it's also more disposable. That means that basically, I have a low tolerance for code not continuing to make sense to me. So if I come back to a piece of code and it doesn't make sense to me, I just throw it out, because it was clearly wrong if I can't make sense of it.

I find it's quite natural to read through the code at this point. Even if I haven't thought about a particular compiler pass for a while, I just start at the beginning, and in a few seconds it's easy to see the structure and overall approach, then I go deeper and see what's actually being done. It's exceptionally easy to work with. Part of this is also because I don't have to jump all over the place to know what the code is doing. If I forgot what a helper does, I just glance a half page up and read the definition of the helper, and I know what's going on then. I don't have to keep a documentation page open to remind myself of the APIs, or the intentions or the like, because all call sites are visible, and the definition of the code is all there for me to see. A big part of this is relying on idioms rather than black box abstractions, patterns instead of information hiding.

2. Yes, I work with a number of other people who have varying levels of responsibility for the code. A good deal of them only care about the high-level concerns, and so they either use the compiler and don't read the code, or they make little tweaks here and their to suit their needs. They don't require a full understanding. A few others do work through the entire code base and we use the codebase as our single source of information when discussing design decisions and everything else. We don't have any need for supporting docs because it's too easy to work with the code to make other design docs useful when discussing architectural elements.

As for picking it up and contributing...that depends. There's a lot of background information you need to know. That's just the reality. The code itself isn't hard, but it is relying on a host of shared understanding. This includes mainly the Nanopass style of compiler design and construction, as well as a firm understanding of array programming, together with a firm understanding of the fundamental idiomatic algorithms being used in the compiler, which are the main "new thing" for people to understand after they understand how to build a compiler and how to write array programs.

Unfortunately, most schools teach neither Nanopass style compiler design, nor array programming. So, the completely uninitiated have a few things to learn. The code itself, though, is no worse than expecting someone to understand C semantics and OOP Tree oriented programming patterns in a monolithic compiler style when working with something like Clang, or tree traversal in C over C when working in a simple compiler like PCC.

3. Actually, looks can be deceiving. I'm doing GPU programming with APL, but over a domain that is considered far outside of the standard APL comfort zone. In fact, a key research contribution of this compiler is that I have developed algorithms and a holistic set of techniques for doing tree/compiler transformations on a GPU, which was previously a very open problem. This compiler/domain would have been considered one of the quintessential examples of a domain that is not well suited to APL. In fact, that's not true, it's just not suited to APL and the traditional way of thinking about this domain.

As for things like web development and the like, you can see MiServer, which is Dyalog's Open Source Web framework. There are some Dyalog User Meeting discussions about the framework, and tutorials, documentation, and the like to get you through it.

APL isn't a perfect language, but there are very few languages pushing the boundaries of human-computer interaction with language in the way that APL does, and I think this is sad.

I think in some ways we might be forced to write programs more like APL in the future, because parallelism is simply that important to the future of computing. I'm also doing research into the early stages of learning computer science, and some initial research suggests that it may in fact be better to teach students parallel programming in an APL style first, and later transition to the more traditional stuff, rather than doing it the opposite way, which so far as just convinced people that they shouldn't write parallel programs at all. In contrast, APLers prefer parallel programming and regular transform code from sequential to parallel and back again. They tend to be comfortable with both modes of thinking, which cannot be said for most other languages in common use. Agent/concurrency heavy languages might be an exception to this general rule, but they also get a bad rap.

Finally, I'd like to make a small note about bringing collaborators onboard. Many collaborators that I've had have been either inexperienced and were learning for some other reason or another, or were experienced and had one specific thing that they wanted to fix. What often happens is that the most inexperienced spend their time, as they should, learning important basics, and don't ever work on the compiler proper. I think this is the way it should be there. The more experienced people, I often find that I fix their problem that they wanted fixed in the compiler as I'm explaining the compiler to them. Because of the way the compiler is designed, I usually find myself saying, "So, to do this, you would have to make this change here [edit] and this here [edit] and ...oh wait, that's it, okay, it's done for you. Enjoy the patch!"

joe_the_user · on March 6, 2017

Thanks you very much for answering questions here.

I have a question about your "key operator" - it sounds similar to an idea I've played with - a way to run heterogeneous operations on GPUs kernel by running a very small bytecode interpreter on each kernel (Something similar to what's outlined in this here[1]).

Does that seem at all related to you?

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.321...

arcfide · on March 6, 2017

That's an interesting idea, and I can see there being some overlap, but the Key operator is actually not my invention, but is a member of a class of operators sometimes called classifiers. You can think about the GROUP BY operator in SQL. Or Henglein's work on discriminators. Key is an operator that is relatively new in Dyalog that came from J and some other APLs, I think. I have to go back to my paper to remember.

http://dl.acm.org/citation.cfm?id=2935331

joe_the_user · on March 6, 2017

I'm scanning the documents.

Basically, group nodes and map/operate on them.

It seem like that would be fairly simple if you are dynamically creating kernels.

-- Edit: Which implies you're doing a pretty "bare metal" interface to the GPU which still lets you do that and there's interesting story there, if I'd not totally off-base.

arcfide · on March 6, 2017

I try to avoid going bare metal as much as I can because of the difficulty involved with working with GPUs at that layer. I try to focus on higher level concerns, and use the excellent low level work of people such as the ArrayFire developers to handle the low level stuff. I still have to understand it and work with it at some level sometimes, but for the most part I stick to my core abstractions.

joe_the_user · on March 6, 2017

So ArrayFire allows the creation of kernels/the calling of functions in parallel/ on the fly?

arcfide · on March 7, 2017

In some limited, restricted cases, yes.

klettow · on March 6, 2017

http://code.jsoftware.com/wiki/Essays/Key#History

RodgerTheGreat · on March 6, 2017

If I follow correctly, the k6 equivalent would be:

    {x'y@=z}

"X of each of Y indexed by the group of Z." As in:

    key[+/;1 2 5;`x`z`z]

arcfide · on March 5, 2017

I believe the translation of |/0(0|+)\ is ⌈/0,(0⌈+)\

beagle3 · on March 6, 2017

I think this gives the correct answer, but the computation is different (which is fine - we're only looking at the answer); K over/scan go left-to-right, and if I'm not mistaken, APL and J go right to left, that is, in K:

    +/!5 equals ((((1+2)+3)+4)+5) which is not 1+2+3+4+5

whereas in APL/J it is

    +/iota5 equals (1+(2+(3+(4+5)))) which is 1+2+3+4+5

I find the APL approach mathematically consistent and elegant, but the K approach is more practical - it makes over and scan useful for state machines, parsing, and many other uses without having to reverse anything twice.

But often, as in this case (or the flatten case) it doesn't matter which direction you start folding from.

Avshalom · on March 5, 2017

K does a thing (like J I think?) where it's scan/reduce returns each subsequence/intermediate-value as it goes which is what makes that problem so trivial in K.

RodgerTheGreat · on March 5, 2017

The K "over" adverb (/) is like a conventional reduce/accumulate/fold. The K "scan" adverb (\) produces a list of the intermediate results. Having both is extremely handy.

If the verb argument of over/scan takes a single argument (rather than two), the adverbs will iterate to a fixed point. As before, "/" will return the final result and "\" returns all the intermediates.

For example:

arcfide · on March 5, 2017

Oh wait, so you're not preserving the subsequence, you just preserving the sum. So it's a standard scan. So my original translation is correct, my subsequence preserving version is doing a bit more than this by preserving the subarray as well as the sum.

arcfide · on March 5, 2017

Ah, if we have to preserve the sequence:

((⊃∘⍒0⌷⍉)⌷⊢)(+\,∘⍪,\)