2015年8月22日星期六

[GSoC 2015 Week 11 & 12]

Finally I found what caused the PR of local_fill_sink optimization can not pass all test cases.

The failed test case is Test_local_elemwise_alloc. It tests local_elemwise_alloc, which replace fill node with alloc. The test case checks if the numbers of alloc and assert are right in the optimized function graph.

What cause its failure is the modification of theano.config.experimental.local_alloc_elemwise_assert in T_local_switch_sink (I was trying to fix T_local_switch_sink's failure at that time and tried to turn this option to be False to prevent local_alloc_elemwise to create assert nodes) in my commit before. The option is set to be False there, so in Test_local_elemwise_alloc, local_elemwise_alloc cannot create new assert nodes --> the number of assert nodes is wrong --> failure.

The solution is quite easy, just delete that line then we are safe. But what confused me is how can the test result be different between my own computer and the travis building server.

The next optimization maybe to remove merge3, but it need to be discussed.

[GSoC 2015 Week 9 & 10]

In week 9 and week 10, I finished making MergeOptimizer be able to deal with nodes with assert inputs.

The Pull Request is here. It can handle following 3 cases.

1).

OP( x, y, z )
OP( assert(x, cond_x), y, z )
====
OP( assert(x, cond_x), y, z )

2).

OP( assert(x, cond_x), y, z )
OP( assert(x, cond_y), y, z )
====
OP( assert(x, cond_x, cond_y), y, z )

3).

OP( assert(x, cond_x), y, z )
OP( x, assert(y, cond_y), z )
====
OP( assert(x, cond_x), assert(y, cond_y), z )

Fore new test cases were also created.

2015年7月27日星期一

[GSoC 2015 Week 7&8]

In Week 7 and Week 8, I am mainly working on the optimization of local_fill_sink. This is more complicated than I thought. For details, check discussions here.

When the code of this PR is being used, the time costed by "canonicalize" is less than the original code. (12 passes in 166 seconds --> 10 passes in 155 seconds, tested with the user case on my computer). 

But these changes make theano fails on a test case, "test_local_mul_switch_sink". This test case is to check if the optimizer "local_mul_switch_sink" behaves correctly. Why does it fail? For short, in this test there is a fgraph like " (* (switch ...) (switch ...) )", if this optimizer is applied correctly, the mul op will sink under the switch op, so that expression like "(* value_x Nan)" can be avoided and end up with right result. 

What stops the optimizer is the assert node inserted into the graph. What I am working on now is to make MergeOptimizer deal with nodes with assert inputs. Of course this is already another optimization. For the failed test case, one way is to modify the code of local_mul_switch_sink, make it able to applied through assert nodes, but this is not a good way because it is not general.

Please reply here or send me email if you have any idea or comments. Thanks very much.


[GSoC 2015 Week 5&6]

In the Week 5 and Week 6, I was working on a new feature for debugging, to display the names of rejected optimizers when doing "replace_and_validate()". The PR is here

To implement this feature in one place in the code, the python library "inspect" is used. Anytime validation fails, the code will inspect the current stack frames and get to know which optimizer is the caller and whether there is "verbose" flag. Then the information for debugging can be displayed.

Besides, optimization for the local_fill_sink optimizer is also began here, the main idea on local_fill_sink is to make "fill" to sink in a recursive way to be more efficiency. 

For the inplace_elemwise optimizer, it is not merged because of the bad performance.

2015年6月22日星期一

[GSoC 2015 Week 4]

This week I'm working with an optimization of inplace_elemwise_optimizer. The idea is described here. In the current version, when inplace_elemwise_optimizer trying to replace outputs of a node, the graph can become invalid, therefore validate() is called frequently to make sure the graph unbroken. But validate() is very time consuming, and the goal of this optimization is to make this optimizer more efficient by applying new validating strategy.

However, this optimization did not work as expected. The total optimization time become 10 times larger:
370.541s for fgraph.validate()
...
1497.519540s - ('inplace_elemwise_optimizer', 'FromFunctionOptimizer', 33) - 315.055s
The origin version:
72.644s for fgraph.validate()
...
143.351832s - ('inplace_elemwise_optimizer', 'FromFunctionOptimizer', 34) - 30.064s

After several small optimization pointed out by my mentor, the time become 1178s.

Why is it slower? I think it is because we are trying to apply the optimizer successfully on all nodes. It is a trade-off between the time took by validate() and the number of nodes optimized. In the past, all failed nodes are ignored directly, so it was fast. Now we are trying to apply on them again and again. validate() is called for much more times than before.

Here is a figure I just plotted to display the nodes number to optimize in each iteration.

In this figure, we know that although it is slower now, there is more nodes are optimized. A better balance should be taken in the trade-off, maybe to make the iteration stop earlier is a good choice? Or maybe the validate() can be optimized?

I'm still working on this. Please tell me if you have any idea.

Thank you.

[GSoC 2015 Week 3]

Hi, this is my third post of weekly record for GSoC 2015.

What I did this week is to implement an optimization on local_dimshuffle_lift optimizer. This optimizer does following transfers,

DimShuffle(Elemwise(x, y)) => Elemwise(DimShuffle(x), DimShuffle(y))
DimShuffle(DimShuffle(x)) => DimShuffle(x)
This optimizer is a local optimizer, which means it will be called by global optimizers several times on different nodes. For example, here is a function graph,

DimShuffle{1,0} [@A] ''   
 |Elemwise{mul,no_inplace} [@B] ''   
   |DimShuffle{x,0} [@C] ''   
   | |Elemwise{add,no_inplace} [@D] ''   
   |   |<TensorType(float64, vector)> [@E]
   |   |DimShuffle{x} [@F] ''   
   |     |TensorConstant{42} [@G]
   |Elemwise{add,no_inplace} [@H] ''   
     |<TensorType(float64, matrix)> [@I]
     |DimShuffle{x,x} [@J] ''   
       |TensorConstant{84} [@K]
If we apply local_dimshuffle_lift on this graph, it can be applied for at least 6 times, looking carefully at how this optimization works we will find that an optimization applied on node @A results in two new subnodes that can be applied on again.

So the idea is to recursively apply the optimizer. So that the optimizer will applied less times than before. An additional test case is also added.

I think perhaps doing things recursively can be replaced by a iterative way? I'll do some experiments first to know recursion's efficiency. Also, this optimization makes me recall another optimization on the to-do list, which will also be done recursively. Is there possibility to extract this pattern out?

2015年6月11日星期四

[GSoC 2015 Week 2]

In the week 2 I implement one optimization to the Equilibrium Optimizer. The PR is here.

In this optimization, an "final optimization" procedure is added to the equilibrium optimization. Final optimizers is a list of global optimizers, and will be applied at the end of every equilibrium optimization pass. The number of optimization pass is expected to decrease, by making right optimizers final ones.

Another change is to delete a node's function graph reference when pruning it from a function graph. So merge optimizer can easily tell whether a node belongs to a graph. It will be useful in other optimizers too.

In the next week, the next 2 optimizations in the to-do list are what I'm going to do:

* Make local_dimshuffle_list lift through many elemwise at once
* Speed up inplace elemwise optimizer

Thanks. Any feedback is welcome!