Combinatorial optimization problems have recently emerged in the design of controllers for OLED displays. The objective is to decompose an image into subframes minimizing the addressing time and thereby also the amplitude of the electrical current through the diodes, which has a direct impact on the lifetime of such a display. To this end, we model this problem as an integer linear program. Subsequently, we refine this formulation by exploiting the combinatorial structure of the problem. We propose a fully combinatorial separation routine for the LP-relaxation based on matching techniques. It can be used as an oracle in various frameworks to derive approximation algorithms or heuristics. We establish NP-hardness and hardness of approximation. Nevertheless, we are able to work around this issue by only focusing on a subsets of the variables and provide experimental evidence that they are sufficient to come up with near optimal solutions in practice. On this basis, one can derive custom-tailored solutions adapting to technical constraints such as memory requirements. By allowing the addressing of distributed doublelines, we improve the addressing time in cases where previous approaches fall short due to their restriction to consecutive doublelines. ⋆ Supported by Deutsche Forschungsgemeinschaft (DFG) within Priority Programme 1307 "Algorithm Engineering"
Introduction
Organic Light Emitting Diodes (OLEDs) have been a hot topic on the display market in the last years as the sizes of commercially available displays increased significantly. Moreover, they provide many advantages over current technology, such as Liquid Crystal Display (LCD).The image and video displayed has a very high contrast and a viewing angle of nearly 180 degrees. It reacts within 10 microseconds, which is much faster than the human eye can catch and is therefore well suited for video applications. Moreover, the display is physically flexible.
There are two different OLED technologies called active matrix (AM) and passive matrix (PM). The former is more expensive but offers a longer lifetime than the latter. Their limited lifetime is one major reason why there are only small-sized displays on the mass market. For mobile phones or digital cameras, large state of the art OLED displays are either too expensive or suffer from insufficient lifetime.
While a lot of research is conducted on the material science side, the so-called Consecutive Multiline Addressing Scheme for passive matrix OLED displays [1] tackles the lifetime-problem from an algorithmic point of view. It is based on the fact that equal rows can be displayed simultaneously with a lower electrical current than in a serial manner [2, 3] . Here we restrict ourselves to an informal description for selfcontainment.
Fig. 1. Schematic electrical circuit of a display
A (passive matrix) OLED display has a matrix structure with n rows and m columns as depicted in Figure 1 . At any crossover between a row and a column there is a vertical diode which works as a pixel. The image itself is given as an integral non-negative n × m matrix. For the sake of simplicity, we first consider the case of binary matrices, i.e. black/white images, and generalize to greyscale and colored images later on.
Consider the contacts for the rows and columns as switches. For the time the switch of row i and column j is closed, an electrical current flows through the diode of pixel (i, j) and it shines. Hence, we can not control each pixel directly. Therefore, such passive matrix displays are traditionally driven row by row in a round-robin fashion. At a sufficiently high framerate, say 50 Hz, the human eye perceives for each pixel only the average over time. Since, we may skip rows that are completely dark, the addressing time varies from to image to image. To maintain the brightness at the same level, we lower the amplitude of electrical current that is sourced into the columns. This procedure has two desired side effects: The power consumption is reduced and their lifetime is extended since high amplitudes of the electrical current are the major issues with respect to the lifetime of the diodes [4] . We can even save more time per frame, if we drive two rows simultaneously. However, this only works, if their content is equal as in the following example. As one can see, we need 5 units of time to display the image in the traditional way, i.e. row by row, whereas 3 units of time are sufficient with so-called Distributed Doubleline Addressing (DDA).
We thereby gain 40% of the time to display this image and as said before we may decrease the amplitude of the electrical current by that amount. Hence, it remains to find an algorithm that computes such a decomposition to benefit from Distributed Doubleline Addressing (DDA). We use the term distributed to distinguish from previous work where it is only allowed to combine consecutive lines. That is, we have to display the first matrix on the right-hand side of our example in two steps, which therefore only permits a reduction of the electrical current by 20% in that case.
To benefit from this decomposition in practice, we should adhere to the following design criteria. Since the algorithm has to be implemented on a driver chip attached to the display, it must have low hardware complexity allowing small production costs. Consequently it has to rely only on a small amount of memory and it should be fully combinatorial, i.e. only additions, subtractions, and comparisons are used. Though it has to solve or approximate the optimization problem, which is formally described in Section 2, in real-time. We do not fulfill all these requirements in one shot. We rather apply an algorithm engineering process that approaches these goals in several iterations.
Previous Work
Algorithmic questions on the restriction to Consecutive Doubleline Addressing (CDA) have been discussed by Eisenbrand et al. [2, 5] . In these papers, the authors also considered to combine more than two lines simultaneously, but only consecutive ones. Other approaches based on Non-negative Matrix Factorization [6, 7] have been outlined by Smith et al. [8] and Smith [9] .
Contributions of this paper
We describe an algorithm engineering process to develop efficient solutions for a realworld problem. That is, we first model the matrix decomposition problem of Distributed Doubleline Addressing as an integer program. On this basis, we improve the formulation by exploiting its combinatorial structure until we achieve a solution which is applicable in practice. On the theory side, we prove that computing optimal decompositions is NP-complete and also hard to approximate within a certain constant factor. To this end, we introduce the Matchable Subset Problem as a special case of our real-world problem. Though the complexity results sound discouraging, they give useful insight into the structure of the problem and also hints to the applicable methods. That is, we adopt approximation techniques such as LP-rounding to come up with a promising method in practice. We derive two LP formulations of our problem: A concise one and one with exponentially many constraints. Though the former is of polynomial size, it is impractical and inferior to the latter. This interesting and on the first glance counterintuitive behavior is due to the fact that we apply techniques known from b-matching to develop an efficient fully combinatorial algorithm for the separation problem of the exponentially many constraints. Finally, we propose parameterized heuristics to achieve a solution, which is applicable in practice. We conclude with a presentation of some computational results showing the improvement with respect to previous work. We thereby show that methods from combinatorial optimization are well-suited to tackle algorithmic challenges in the design of flat panel display drivers.
A Linear Programming Formulation
In this section, we will briefly introduce a linear programming model for DDA. The interested reader is referred to [2] for a more profound elaboration on the technical details. For the sake of simplicity, we restrict ourselves to the special case of black/white images given by binary matrices R = (r ij ) ∈ {0, 1} n×m for time being. Let the binary variables f j (i, k) ∈ {0, 1} denote whether the switch for column j is closed while the switches of the rows i and k are closed. Note that if i equals k, then the corresponding variables represent a single line. Moreover, f j (i, k) and f j (k, i) represent the same switches and hence we implicitly require f j (i, k) = f j (k, i) in the following. To get a lossless decomposition of the image R, the following constraints must hold.
Recall that our objective is to minimize the addressing time for each given image. Clearly, if we have for some pair {i, k} that f j (i, k) = 0 for all j ∈ {1, . . . , m}, we can skip this doubleline (or singleline if i equals k). Hence, the total number of subframes is given by
We apply the standard trick to derive a linear programming formulation by replacing each maximum by an auxiliary variable u(i, k). This yields
This LP formulation is not integral in general, which can be verified by an example with three rows and a single column with R = (1, 1, 1) T . The fractional optimum of
Experimental Evaluation
This basic LP formulation permits a first evaluation of the DDA approach using standard software. Though it is clear that we will not be able to implement a general purpose LP-solver on a chip that drives such a display, we can use the LP-solutions as a benchmark for our further algorithms and heuristics. Although the formulation (1) is concise in the theoretical sense as only a polynomial number (with respect to the input size) of variables and constraints are used, the programs get huge for typical OLED displays.
To give the reader a figure, we present the numbers for QQVGA resolution, e.g. subdisplays for mobile phones. There, we have n = 120 rows and m = 3 · 160 = 480 columns. This means that we have (m + 1)n(n + 1)/2 ≈ 3.5 · 10 6 variables and about the same number of constraints in our LP. Hence, it comes to no surprise that CPLEX 10.0 takes for a much smaller LP of 30 rows already about 4 minutes on a 2.8 GHz Dualcore AMD Opteron with 16 GB RAM and the run for QQVGA did not finish within 300 hours. Clearly, we must have a deeper look at the theoretical properties of our problem to make any progress. To this end, we refine the formulation of our problem by exploiting its combinatorial structure.
Combinatorial Refinement
A closer look at the LP formulation (1) reveals that the objective only depends on the uvariables. Moreover, if those variables were fixed, then the problem would decompose into m independent parts. Hence, we wish to have an efficient method to solve the separation problem for the u-variables. That is, given an assignment to the u-variables, to decide whether all the independent parts are feasible, and if not, to return a violated inequality.
To this end, we introduce a combinatorial formulation of our problem. It is straightforward to consider an undirected graph G = (V, E) where each vertex i ∈ V corresponds to a row of the display and the edgeset E represents the pairs of row-switches. Note that we allow self-loops in G to model the singlelines. If no further restrictions are given, then G is the complete graph on n nodes.
In the following, we consider the column vectors of an image R as functions r j : V → Z ≥0 . A lossless decomposition is then considered as a perfect r j -matching problem for each column j = 1, . . . , m. That is, the set of feasible timings for column j is given by the polyhedron
where δ(·) denotes the set of incident edges and f (δ(·)) means the sum over variables of these edges.
Recall that the timings for the row-switches is determined by the maxima over all columns. That is, we have a variable u(e) for each edge e ∈ E. A row-timing u : E → Z ≥0 is feasible, if and only if for each column j ∈ {1, . . . , m} there is a feasible matching f j ∈ P j with f j ≤ u. Hence, a row-timing u is feasible for a column j if and only if it is contained in the up-hull of P j , i.e. u ∈ P ↑ j := P j + R |E| ≥0 . Thus, the set of 6 Andreas Karrenbauer feasible row-timings is given by the polyhedron
The problem can now be divided into two parts and understood as follows.
1. Find a row-timing u ∈ P that minimizes the sum u(E).
2. For each column j = 1, . . . , m, compute a u-capacitated perfect r j -matching f j representing the timings for the corresponding column switches.
Note that in the second step, the columns become independent and the matching problems could be solved in parallel. Moreover, there are several combinatorial algorithms known from literature to solve this task. It remains to find a good characterization of P , e.g. by an efficient combinatorial algorithm for the separation problem. That is, given a vector u ′ determine an inequality that is valid for all elements of P but violated for u ′ , or assert that no such inequality exists and hence u ′ is contained in P .
denotes the inner edges of X excluding the self-loops, δ(·) also contains the self-loops of X, and N (X) ⊂ V is the neighborhood of X in G.
We sketch the proof in the following, since it reveals our implementation of the separation routine for these inequalities. We show first that each of the Inequalities (2) is valid for P . To this end, arbitrarily fix u ∈ P , X ⊆ V , Y ⊂ N (X), and j ∈ {1, . . . , m}. Since u ∈ P , there is a f ∈ P j with f ≤ u. Hence,
To prove sufficiency, we transform the problem to the uncapacitated case. To this end, we split each edge e = {i, k} ∈ E by two new nodes e 1 , e 2 such that we get a new graph G ′ = (V ′ , E ′ ) with V ′ := V∪ {e 1 , e 2 : e ∈ E} E ′ := {i, e 1 }, {e 1 , e 2 }, {e 2 , k} : e = {i, k} ∈ E .
Note that the self-loops transform into 3-cycles as depicted in Figure 2 .
Consider j to be fixed for time being. We define b : V ′ → Z ≥0 with b(i) := r j (i) for all i ∈ V , b(e 1 ) := b(e 2 ) := u(e) for all e ∈ E that are no self-loops, and b(e 1 ) := b(e 2 ) := u(e)/2 for all self-loops e. It is easy to verify that a fractional perfect b-matching in G ′ corresponds to a fractional u-capacitated perfect r j -matching in G and vice versa. The value attributed to the middle segment in G ′ determines the slack of the corresponding edge in G. The transformation to the uncapacitated case allows us to use a well-known characterization of the existence of perfect 2b-matchings (cf. Corollary 31.5a in [10] ) that we can state in our context as follows. 
It remains to show that it is sufficient to consider only Y ⊆ N (X) in the original graph. However, this is easy to see since any y ∈ Y \ N (X) would weaken the right-hand side and would leave the left-hand side unchanged. Algorithmically, a violated inequality for a given assignment u : E → Z ≥0 can be found by a further transformation of the uncapacitated perfect b-matching problem to a transportation problem. That is, we construct a bipartite graph G ′′ such that each part of the bipartition consists of a copy of V ′ , say V ′′ := V ′ 1∪ V ′ 2 , and for each edge {v, w} ∈ E ′ we have the two edges {v 1 , w 2 } and {v 2 , w 1 } in E ′′ . By directing the edges from V 1 to V 2 and considering the nodes of V 1 as supplies and the nodes of V 2 as demands, the separation problem becomes a transshipment problem, which can be solved by a maximum flow computation. If and only if the value of the maximum flow equals b(V ′ ), then there exist a fractional perfect b-matching in G ′ . If the value of the maximum flow, or by duality the minimum cut, is smaller, then the nodes of G ′′ constituting a minimum cut also represent a vertex cover y :
Moreover, the set
which gives the equivalent violated inequality b(S) > b(N ′ (S)).
Further Improvements
In principle, we could start the separation with the zero-vector. However, it is much more efficient to provide a starting set of valid inequalities which is of moderate size and easy to solve but still yields a dual solution which is not too far from the optimum.
It is straightforward to select the inequalities arising from the sets X = {i} and Y = ∅ for each i ∈ V . This means, we simply bound the variables in the perfect matching constraints by the corresponding capacities and get
where r i := max{r ij : j = 1, . . . , m}.
For our application, it is easy to see that the optimal objective value of this partial solution is at least half of the optimum of the whole problem, since u(i, i) := r i and u(i, k) := 0 for all i = k is a feasible solution and
holds by summing up the inequalities (3) and the non-negativity constraints for u(i, i).
Note that these inequalities together with the non-negativity constraints determine the fractional r-edge cover polytope. This has the following two consequences for us: Firstly, there is a fully combinatorial algorithm to compute a minimum fractional b-edge cover. Secondly, the integer r-edge cover polyhedron is contained in the integer hull P I of P . Hence, we may also add the valid inequalities
to the description of P I . Recall that we made the temporary assumption that our display controller works with arbitrary precision. But in the real world, this is hardly possible since our digital circuit shall work with a fixed clock frequency. Hence, there is a minimum amount of time for which switches can be closed and opened again. Thus, we only have fixed precision, which is equivalent to require the variables to be integer by appropriate scaling. Nevertheless, the separation routine has not become useless since we already get a very simple approximation algorithm by simply rounding each fractional variable to the next greater integer. Note that then the obtained integer solution has an objective value within an additive error of |E|. Moreover, the separation routine can be used within the framework of [2] to come up with fully combinatorial heuristics.
It is natural to ask whether there is a completely different approach to solve the problem exactly in polynomial time. But there is little hope because of the NP-completeness result of Section 4. Before we come to this section, we give a brief overview of the experimental results with respect to the combinatorial separation.
Experimental Evaluation
We implemented the combinatorial separation using the LEDA 6.1 library to solve the transportation problem by the built-in MAXFLOW routine. We can use this separation to speed up the solution time of CPLEX. For example, the instance with 30 rows mentionend before is now solved in 14 seconds instead of 240 seconds. However, it is still not possible to solve the LP relaxation of a full QQVGA instance in timely manner.
Hardness Results
We show in this section that already the restriction to the black/white case, i.e. binary matrices R ∈ {0, 1} n×m , is NP-complete and also hard to approximate. To this end, we define the Matchable Subset Problem and analyze its complexity.
Definition 1 (Matchable Subset Problem).
Given an undirected graph G = (V, E) and m subsets of the nodes V 1 , . . . , V m ⊆ V , find an edgesetẼ ⊂ E of minimum cardinality such that for each j ∈ {1, . . . , m} the set V j is matchable inG = (V,Ẽ), i.e. there is a perfect matching in the subgraph ofG induced by V j . Theorem 2. The Matchable Subset Problem is NP-complete, even when restricted to complete graphs (with or without self-loops).
Proof. Clearly, the problem is in NP. We show hardness by a reduction from vertex cover. Given an undirected graph G = (V, E), we construct an undirected graph G ′ = (V ′ , E ′ ) as follows. Let Given an edgesetẼ ⊆ E ′ such that for every e = {u, v} ∈ E the set {s, u, v, t e } is matchable in the graph (V ′ ,Ẽ), we define the nodeset C := {u ∈ V : {s, u} ∈Ẽ}. By construction, we have that |C| = |Ẽ| − |E|. We show next that C is a vertex cover in G. Let e = {u, v} be an arbitrary edge. This implies that {s, u}, {s, v} ∩Ẽ = ∅, since {s, u, v, t e } has a perfect matching withinẼ. Hence, {u, v} ∩ C = ∅, which proves that C is a vertex cover in G.
Conversely, if C is a vertex cover in G, then we defineẼ ⊆ E ′ as follows. We distinguish the two cases if an edge of G is covered by one or two vertices in C. If e = {u, v} ∈ E is covered by exactly one vertex in C, say C ∪ e = {u}, we include the edges {s, u} and {v, t e } inẼ. If both of ends of an edge e = {u, v} are contained Fig. 3 . The construction for the hardness-proof is illustrated on the left and the one for the decomposition problem on the right.
in the cover C, then we include {s, u}, {s, v}, and an arbitrary edge of {u, t e } and {v, t e }. This yields |Ẽ| = |C| + |E| and moreover for each edge e = {u, v} ∈ E the set {s, u, v, t e } has a perfect matching withinẼ. Including also the edges {u, v} and {s, t e } in E ′ does not help as they are only contained in one induced subgraph. Moreover, it is easy to see that self-loops are not suited to improve the solution.
⊓ ⊔
The relation to DDA is as follows. Consider the complete graph and subsets V 1 , . . . , V m of its nodes. For each j = 1, . . . , m, let r j be the characteristic vector of V j . The shortest DDA timing for the matrix R made up by the column vectors r j is then equal to the minimum-cardinality edgeset solving the Matchable Subset Problem. Hence, the vertex cover problem for a graph G = (V, E) can be solved as follows. From the node-edgeincidence matrix A ∈ {0, 1} |V |×|E| , we construct the image R ∈ {0, 1} (1+|V |+|E|)×|E| as shown on the right of Figure 3 . The optimum number of timesteps to display the image using DDA is equal to the minimum size of a vertex cover of G plus the number of edges in G. Note that the constructed graph G ′ does not contain odd cycles and thus constraining the input to bipartite graphs is not a restriction. Furthermore, it does not make the problem easier if we also consider fractional perfect matchings. Note that vertex cover is hard to approximate within a factor α > 1.36 [11] . Moreover, hardness of approximation also holds for graphs with bounded degree [12] . Thus, we get the following theorem. Theorem 3. There is a constant β such that it is NP-hard to approximate the Matchable Subset Problem within a factor of β.
Proof. An approximation algorithm for the Matchable Subset Problem with guarantee β yields an approximation algorithm for vertex cover with
where ∆ denotes the maximum degree of G. Hence, if it is hard to approximate vertex cover on a graph with maximum degree ∆ within an approximation factor α(∆), then it is hard to approximate the matchable subset problem within an approximation factor β = α(∆) + ∆ ∆ + 1 ⊓ ⊔
We get β ≥ 261 260 = 1.00385 by the numbers from [12] and graphs with maximum degree 4. By the previous considerations, this also holds for the shortest DDA timing.
Cutting the Bandwidth
In the previous Section, we have seen that the problem is hard for complete graphs. Moreover, the experimental evaluations have shown that the large number of variables prevents ourselves from solving even the LP-relaxation in a timely manner. Hence, it is natural to dismiss some (or most) of them, i.e. to consider a suitable subgraph. From previous work [5] , we know that Consecutive Doubleline Addressing (CDA) works pretty well in practice. A close look reveals that CDA is a special case of DDA, when we consider a path with n nodes (and self-loops at every node) as the corresponding graph. It is easy to see that this graph has bandwidth 1. If we take the square of this graph, i.e. inserting edges that skip one node on the path, we get bandwidth 2. The third power yields bandwith 3, and so on. Note that the n-th power, i.e. bandwidth n, is again the complete graph. We use the same testset of images as in [2] and [5] in QQVGA resolution (i.e. n = 120 and m = 3 · 160 = 480) to compare the different bandwidths b = 1, 2, 3, 4. We do so with a small uncertainty by taking the mean of the LP-relaxation and the objective value after the naive rounding. This is justified since the error is negligible and the running time for solving the integer linear program is much higher, e.g. 27 seconds compared to 51.5 minutes. In fact, the error is so small that the error bars would not exceed the symbol size in Figure 4 . The reduction factor of the worst instance for CDA dropped from 63% via 58% and 54% to 52% for b = 2, 3, 4, respectively. As one can see in Figure 4 , there is a saturation
