2015年6月22日星期一

[GSoC 2015 Week 4]

This week I'm working with an optimization of inplace_elemwise_optimizer. The idea is described here. In the current version, when inplace_elemwise_optimizer trying to replace outputs of a node, the graph can become invalid, therefore validate() is called frequently to make sure the graph unbroken. But validate() is very time consuming, and the goal of this optimization is to make this optimizer more efficient by applying new validating strategy.

However, this optimization did not work as expected. The total optimization time become 10 times larger:
370.541s for fgraph.validate()
...
1497.519540s - ('inplace_elemwise_optimizer', 'FromFunctionOptimizer', 33) - 315.055s
The origin version:
72.644s for fgraph.validate()
...
143.351832s - ('inplace_elemwise_optimizer', 'FromFunctionOptimizer', 34) - 30.064s

After several small optimization pointed out by my mentor, the time become 1178s.

Why is it slower? I think it is because we are trying to apply the optimizer successfully on all nodes. It is a trade-off between the time took by validate() and the number of nodes optimized. In the past, all failed nodes are ignored directly, so it was fast. Now we are trying to apply on them again and again. validate() is called for much more times than before.

Here is a figure I just plotted to display the nodes number to optimize in each iteration.

In this figure, we know that although it is slower now, there is more nodes are optimized. A better balance should be taken in the trade-off, maybe to make the iteration stop earlier is a good choice? Or maybe the validate() can be optimized?

I'm still working on this. Please tell me if you have any idea.

Thank you.

[GSoC 2015 Week 3]

Hi, this is my third post of weekly record for GSoC 2015.

What I did this week is to implement an optimization on local_dimshuffle_lift optimizer. This optimizer does following transfers,

DimShuffle(Elemwise(x, y)) => Elemwise(DimShuffle(x), DimShuffle(y))
DimShuffle(DimShuffle(x)) => DimShuffle(x)
This optimizer is a local optimizer, which means it will be called by global optimizers several times on different nodes. For example, here is a function graph,

DimShuffle{1,0} [@A] ''   
 |Elemwise{mul,no_inplace} [@B] ''   
   |DimShuffle{x,0} [@C] ''   
   | |Elemwise{add,no_inplace} [@D] ''   
   |   |<TensorType(float64, vector)> [@E]
   |   |DimShuffle{x} [@F] ''   
   |     |TensorConstant{42} [@G]
   |Elemwise{add,no_inplace} [@H] ''   
     |<TensorType(float64, matrix)> [@I]
     |DimShuffle{x,x} [@J] ''   
       |TensorConstant{84} [@K]
If we apply local_dimshuffle_lift on this graph, it can be applied for at least 6 times, looking carefully at how this optimization works we will find that an optimization applied on node @A results in two new subnodes that can be applied on again.

So the idea is to recursively apply the optimizer. So that the optimizer will applied less times than before. An additional test case is also added.

I think perhaps doing things recursively can be replaced by a iterative way? I'll do some experiments first to know recursion's efficiency. Also, this optimization makes me recall another optimization on the to-do list, which will also be done recursively. Is there possibility to extract this pattern out?

2015年6月11日星期四

[GSoC 2015 Week 2]

In the week 2 I implement one optimization to the Equilibrium Optimizer. The PR is here.

In this optimization, an "final optimization" procedure is added to the equilibrium optimization. Final optimizers is a list of global optimizers, and will be applied at the end of every equilibrium optimization pass. The number of optimization pass is expected to decrease, by making right optimizers final ones.

Another change is to delete a node's function graph reference when pruning it from a function graph. So merge optimizer can easily tell whether a node belongs to a graph. It will be useful in other optimizers too.

In the next week, the next 2 optimizations in the to-do list are what I'm going to do:

* Make local_dimshuffle_list lift through many elemwise at once
* Speed up inplace elemwise optimizer

Thanks. Any feedback is welcome!

2015年6月4日星期四

GSoC started! [GSoC 2015 Week 1]

Theano is a symbolic expression calculator. User write expressions and functions in symbolic form, and then compile them to make them executable, so that user can do real calculation with them, feed data in and get output.

As I mentioned in last post, my task here is to make theano compile faster. So the very first step is profiling. During community bonding period, I profiled an autoencoder model which is very simple and a more complicated model provided by my mentor. The first interesting thing I found was that the sum of time in profiling output is shorter than the time it actually takes. I plan to do some more profiling to find out the reason and to make it covered by theano profiling module, but not now.

In bigger model, optimization is the most time-cost procedure during compile. Optimizations are introduced to make calculation more efficient, by changing the function graph without side effect. (http://deeplearning.net/software/theano/tutorial/symbolic_graphs.html#optimizations) Optimizations are useful, but it sometimes take too much time to apply them.

So here is a optimizations' to-do list. As I'm not very familiar with compiling procedure yet, I didn't purpose any optimization. But I will do that when I can. The items listed now are added by my mentor and other developers, thank you.

This week I did the first optimization, which is removing unnecessary operator chain, gpu_from_host and host_from_gpu. Theano can calculate with gpu. When doing optimization, theano tries to move operators and data into gpu, gpu_from_host is introduced to do that. Things like gpu_from_host(host_from_gpu(x_whatever_already_in_gpu)) can be generated, although they can be replaced with x_whatever_already_in_gpu in next applied optimizer "gpu_cut_transfer", it take time. Here I do some check to make sure gpu_from_host(host_from_gpu(x)) not introduced when moving data to gpu. The pull request is here.

I'm thinking about a possible optimization to do similar to this one, but still need more experience and experiments.

It was really interesting to read theano's code.