Finally I found what caused the PR of local_fill_sink optimization can not pass all test cases.
The failed test case is Test_local_elemwise_alloc. It tests local_elemwise_alloc, which replace fill node with alloc. The test case checks if the numbers of alloc and assert are right in the optimized function graph.
What cause its failure is the modification of theano.config.experimental.local_alloc_elemwise_assert in T_local_switch_sink (I was trying to fix T_local_switch_sink's failure at that time and tried to turn this option to be False to prevent local_alloc_elemwise to create assert nodes) in my commit before. The option is set to be False there, so in Test_local_elemwise_alloc, local_elemwise_alloc cannot create new assert nodes --> the number of assert nodes is wrong --> failure.
The solution is quite easy, just delete that line then we are safe. But what confused me is how can the test result be different between my own computer and the travis building server.
The next optimization maybe to remove merge3, but it need to be discussed.
2015年8月22日星期六
[GSoC 2015 Week 9 & 10]
In week 9 and week 10, I finished making MergeOptimizer be able to deal with nodes with assert inputs.
The Pull Request is here. It can handle following 3 cases.
1).
OP( x, y, z )
OP( assert(x, cond_x), y, z )
====
OP( assert(x, cond_x), y, z )
2).
OP( assert(x, cond_x), y, z )
OP( assert(x, cond_y), y, z )
====
OP( assert(x, cond_x, cond_y), y, z )
3).
OP( assert(x, cond_x), y, z )
OP( x, assert(y, cond_y), z )
====
OP( assert(x, cond_x), assert(y, cond_y), z )
Fore new test cases were also created.
The Pull Request is here. It can handle following 3 cases.
1).
OP( x, y, z )
OP( assert(x, cond_x), y, z )
====
OP( assert(x, cond_x), y, z )
2).
OP( assert(x, cond_x), y, z )
OP( assert(x, cond_y), y, z )
====
OP( assert(x, cond_x, cond_y), y, z )
3).
OP( assert(x, cond_x), y, z )
OP( x, assert(y, cond_y), z )
====
OP( assert(x, cond_x), assert(y, cond_y), z )
Fore new test cases were also created.
2015年7月27日星期一
[GSoC 2015 Week 7&8]
In Week 7 and Week 8, I am mainly working on the optimization of local_fill_sink. This is more complicated than I thought. For details, check discussions here.
When the code of this PR is being used, the time costed by "canonicalize" is less than the original code. (12 passes in 166 seconds --> 10 passes in 155 seconds, tested with the user case on my computer).
But these changes make theano fails on a test case, "test_local_mul_switch_sink". This test case is to check if the optimizer "local_mul_switch_sink" behaves correctly. Why does it fail? For short, in this test there is a fgraph like " (* (switch ...) (switch ...) )", if this optimizer is applied correctly, the mul op will sink under the switch op, so that expression like "(* value_x Nan)" can be avoided and end up with right result.
What stops the optimizer is the assert node inserted into the graph. What I am working on now is to make MergeOptimizer deal with nodes with assert inputs. Of course this is already another optimization. For the failed test case, one way is to modify the code of local_mul_switch_sink, make it able to applied through assert nodes, but this is not a good way because it is not general.
Please reply here or send me email if you have any idea or comments. Thanks very much.
[GSoC 2015 Week 5&6]
In the Week 5 and Week 6, I was working on a new feature for debugging, to display the names of rejected optimizers when doing "replace_and_validate()". The PR is here.
To implement this feature in one place in the code, the python library "inspect" is used. Anytime validation fails, the code will inspect the current stack frames and get to know which optimizer is the caller and whether there is "verbose" flag. Then the information for debugging can be displayed.
Besides, optimization for the local_fill_sink optimizer is also began here, the main idea on local_fill_sink is to make "fill" to sink in a recursive way to be more efficiency.
For the inplace_elemwise optimizer, it is not merged because of the bad performance.
2015年6月22日星期一
[GSoC 2015 Week 4]
This week I'm working with an optimization of inplace_elemwise_optimizer. The idea is described here. In the current version, when inplace_elemwise_optimizer trying to replace outputs of a node, the graph can become invalid, therefore validate() is called frequently to make sure the graph unbroken. But validate() is very time consuming, and the goal of this optimization is to make this optimizer more efficient by applying new validating strategy.
However, this optimization did not work as expected. The total optimization time become 10 times larger:
After several small optimization pointed out by my mentor, the time become 1178s.
Why is it slower? I think it is because we are trying to apply the optimizer successfully on all nodes. It is a trade-off between the time took by validate() and the number of nodes optimized. In the past, all failed nodes are ignored directly, so it was fast. Now we are trying to apply on them again and again. validate() is called for much more times than before.
Here is a figure I just plotted to display the nodes number to optimize in each iteration.
In this figure, we know that although it is slower now, there is more nodes are optimized. A better balance should be taken in the trade-off, maybe to make the iteration stop earlier is a good choice? Or maybe the validate() can be optimized?
I'm still working on this. Please tell me if you have any idea.
Thank you.
However, this optimization did not work as expected. The total optimization time become 10 times larger:
370.541s for fgraph.validate()
...
1497.519540s - ('inplace_elemwise_optimizer', 'FromFunctionOptimizer', 33) - 315.055s
The origin version:72.644s for fgraph.validate()
...
143.351832s - ('inplace_elemwise_optimizer', 'FromFunctionOptimizer', 34) - 30.064s
After several small optimization pointed out by my mentor, the time become 1178s.
Why is it slower? I think it is because we are trying to apply the optimizer successfully on all nodes. It is a trade-off between the time took by validate() and the number of nodes optimized. In the past, all failed nodes are ignored directly, so it was fast. Now we are trying to apply on them again and again. validate() is called for much more times than before.
Here is a figure I just plotted to display the nodes number to optimize in each iteration.
I'm still working on this. Please tell me if you have any idea.
Thank you.
[GSoC 2015 Week 3]
Hi, this is my third post of weekly record for GSoC 2015.
What I did this week is to implement an optimization on local_dimshuffle_lift optimizer. This optimizer does following transfers,
DimShuffle(Elemwise(x, y)) => Elemwise(DimShuffle(x), DimShuffle(y))
DimShuffle(DimShuffle(x)) => DimShuffle(x)
This optimizer is a local optimizer, which means it will be called by global optimizers several times on different nodes. For example, here is a function graph,
I think perhaps doing things recursively can be replaced by a iterative way? I'll do some experiments first to know recursion's efficiency. Also, this optimization makes me recall another optimization on the to-do list, which will also be done recursively. Is there possibility to extract this pattern out?
DimShuffle{1,0} [@A] ''
|Elemwise{mul,no_inplace} [@B] ''
|DimShuffle{x,0} [@C] ''
| |Elemwise{add,no_inplace} [@D] ''
| |<TensorType(float64, vector)> [@E]
| |DimShuffle{x} [@F] ''
| |TensorConstant{42} [@G]
|Elemwise{add,no_inplace} [@H] ''
|<TensorType(float64, matrix)> [@I]
|DimShuffle{x,x} [@J] ''
|TensorConstant{84} [@K]
If we apply local_dimshuffle_lift on this graph, it can be applied for at least 6 times, looking carefully at how this optimization works we will find that an optimization applied on node @A results in two new subnodes that can be applied on again.
So the idea is to recursively apply the optimizer. So that the optimizer will applied less times than before. An additional test case is also added.
I think perhaps doing things recursively can be replaced by a iterative way? I'll do some experiments first to know recursion's efficiency. Also, this optimization makes me recall another optimization on the to-do list, which will also be done recursively. Is there possibility to extract this pattern out?
2015年6月11日星期四
[GSoC 2015 Week 2]
In the week 2 I implement one optimization to the Equilibrium Optimizer. The PR is here.
In this optimization, an "final optimization" procedure is added to the equilibrium optimization. Final optimizers is a list of global optimizers, and will be applied at the end of every equilibrium optimization pass. The number of optimization pass is expected to decrease, by making right optimizers final ones.
Another change is to delete a node's function graph reference when pruning it from a function graph. So merge optimizer can easily tell whether a node belongs to a graph. It will be useful in other optimizers too.
In the next week, the next 2 optimizations in the to-do list are what I'm going to do:
* Make local_dimshuffle_list lift through many elemwise at once
* Speed up inplace elemwise optimizer
Thanks. Any feedback is welcome!
In this optimization, an "final optimization" procedure is added to the equilibrium optimization. Final optimizers is a list of global optimizers, and will be applied at the end of every equilibrium optimization pass. The number of optimization pass is expected to decrease, by making right optimizers final ones.
Another change is to delete a node's function graph reference when pruning it from a function graph. So merge optimizer can easily tell whether a node belongs to a graph. It will be useful in other optimizers too.
In the next week, the next 2 optimizations in the to-do list are what I'm going to do:
* Make local_dimshuffle_list lift through many elemwise at once
* Speed up inplace elemwise optimizer
Thanks. Any feedback is welcome!
2015年6月4日星期四
GSoC started! [GSoC 2015 Week 1]
Theano is a symbolic expression calculator. User write expressions and functions in symbolic form, and then compile them to make them executable, so that user can do real calculation with them, feed data in and get output.
As I mentioned in last post, my task here is to make theano compile faster. So the very first step is profiling. During community bonding period, I profiled an autoencoder model which is very simple and a more complicated model provided by my mentor. The first interesting thing I found was that the sum of time in profiling output is shorter than the time it actually takes. I plan to do some more profiling to find out the reason and to make it covered by theano profiling module, but not now.
In bigger model, optimization is the most time-cost procedure during compile. Optimizations are introduced to make calculation more efficient, by changing the function graph without side effect. (http://deeplearning.net/software/theano/tutorial/symbolic_graphs.html#optimizations) Optimizations are useful, but it sometimes take too much time to apply them.
So here is a optimizations' to-do list. As I'm not very familiar with compiling procedure yet, I didn't purpose any optimization. But I will do that when I can. The items listed now are added by my mentor and other developers, thank you.
This week I did the first optimization, which is removing unnecessary operator chain, gpu_from_host and host_from_gpu. Theano can calculate with gpu. When doing optimization, theano tries to move operators and data into gpu, gpu_from_host is introduced to do that. Things like gpu_from_host(host_from_gpu(x_whatever_already_in_gpu)) can be generated, although they can be replaced with x_whatever_already_in_gpu in next applied optimizer "gpu_cut_transfer", it take time. Here I do some check to make sure gpu_from_host(host_from_gpu(x)) not introduced when moving data to gpu. The pull request is here.
I'm thinking about a possible optimization to do similar to this one, but still need more experience and experiments.
It was really interesting to read theano's code.
As I mentioned in last post, my task here is to make theano compile faster. So the very first step is profiling. During community bonding period, I profiled an autoencoder model which is very simple and a more complicated model provided by my mentor. The first interesting thing I found was that the sum of time in profiling output is shorter than the time it actually takes. I plan to do some more profiling to find out the reason and to make it covered by theano profiling module, but not now.
In bigger model, optimization is the most time-cost procedure during compile. Optimizations are introduced to make calculation more efficient, by changing the function graph without side effect. (http://deeplearning.net/software/theano/tutorial/symbolic_graphs.html#optimizations) Optimizations are useful, but it sometimes take too much time to apply them.
So here is a optimizations' to-do list. As I'm not very familiar with compiling procedure yet, I didn't purpose any optimization. But I will do that when I can. The items listed now are added by my mentor and other developers, thank you.
This week I did the first optimization, which is removing unnecessary operator chain, gpu_from_host and host_from_gpu. Theano can calculate with gpu. When doing optimization, theano tries to move operators and data into gpu, gpu_from_host is introduced to do that. Things like gpu_from_host(host_from_gpu(x_whatever_already_in_gpu)) can be generated, although they can be replaced with x_whatever_already_in_gpu in next applied optimizer "gpu_cut_transfer", it take time. Here I do some check to make sure gpu_from_host(host_from_gpu(x)) not introduced when moving data to gpu. The pull request is here.
I'm thinking about a possible optimization to do similar to this one, but still need more experience and experiments.
It was really interesting to read theano's code.
2015年5月24日星期日
Theano: proposal changed
Hi there, I'm Ziye, I'm taking part in this year's GSoC. I apply for the Theano project because I use it in my own lab researching work (Btw, I do some music information retrieval work in lab).
The original proposal is to decrease the python overhead generated by theano. The new one is to decrease the compiling time. It will improve the theano performance. During the community bonding period, with help of my mentor Fred, the optimization objective is almost confirmed.
In the upcoming coding period. I'll put more time on it. It will be interesting.
Ziye
The original proposal is to decrease the python overhead generated by theano. The new one is to decrease the compiling time. It will improve the theano performance. During the community bonding period, with help of my mentor Fred, the optimization objective is almost confirmed.
In the upcoming coding period. I'll put more time on it. It will be interesting.
Ziye
2015年3月27日星期五
订阅:
博文 (Atom)