Model-Free Alternative

Over half of the world’s population live in urban areas, a trend which is expected to only grow as we move further into the future. With this increasing trend in urbanisation, challenges are presented in the form of the management of urban infrastructure systems.

As an essential infrastructure of any city, the energy system presents itself as one of the biggest challenges. Indeed, as cities expand in population and economically, global energy consumption increases, and as a result, so do greenhouse gas (GHG) emissions. Key to realising the goals as laid out by the 2030 Agenda for Sustainable Development, is the energy transition – embodied in the goals pertaining to affordable and clean energy, sustainable cities and communities, and climate action. Renewable energy systems (RESs) and energy efficiency have been shown as key strategies towards achieving these goals.

While the building sector is considered to be one of the biggest contributors to climate change, it is also seen as an area with many opportunities for realising the energy transition. Indeed, the emergence of the smart city and the internet of things (IoT), alongside Photovoltaic and battery technology, offers opportunities for both the smart management of buildings, as well as the opportunity to form self-sufficient peer-to-peer (P2P) electricity trading communities. Within this context, advanced building control offers significant potential for mitigating global warming, grid instability, soaring energy costs, and exposure to poor indoor building climates.

Most advanced control strategies, however, rely on complex mathematical models, which require a great deal of expertise to construct, thereby costing in time and money, and are unlikely to be frequently updated – which can lead to suboptimal or even wrong performance. Furthermore, arriving at solutions in economic settings as complex and dynamic as the P2P electricity markets referred to above, often leads to solutions that are computationally intractable.

A model-based approach thus seems, as alluded to above, unsustainable, and Ross May thus propose taking a model-free alternative instead.

One such alternative is the reinforcement learning (RL) method. This method provides a beautiful solution that addresses many of the limitations seen in more classical approaches – those based on complex mathematical models – to single- and multi-agent systems.

> To address the feasibility of RL in the context of building systems, I have developed four papers.

In studying the literature, while there is much review work in support of RL for controlling energy consumption, it was found that there were no such works analysing RL from a methodological perspective w.r.t. controlling the comfort level of building occupants.

> Thus, in Paper I, to fill in this gap in knowledge, a comprehensive review in this area was carried out.

To follow up, in Paper II, a case study was conducted to further assess, among other things, the computational feasibility of RL for controlling occupant comfort in a single agent context. It was found that the RL method was able to improve thermal and indoor air quality by more than 90% when compared with historically observed occupant data.

Broadening the scope of RL, Papers III and IV considered the feasibility of RL at the district scale by considering the efficient trade of renewable electricity in a peer-to-peer prosumer energy market.

In particular, in Paper III, by extending an open source economic simulation framework, multi-agent reinforcement learning (MARL) was used to optimise a dynamic price policy for trading the locally produced electricity. Compared with a benchmark fixed price signal, the dynamic price mechanism arrived at by RL, increased community net profit by more than 28%, and median community self-sufficiency by more than 2%. Furthermore, emergent social-economic behaviours such as changes in supply w.r.t changes in price were identified.

A limitation of Paper III, however, is that it was conducted in a single environment. To address this limitation and to assess the general validity of the proposed MARL-solution, in Paper IV a full factorial experiment based on the factors of climate – manifested in heterogeneous demand/supply profiles and associated battery parameters, community scale, and price mechanism, was conducted in order to ascertain the response of the community w.r.t net-loss (financial gain), self-sufficiency, and income equality from trading locally produced electricity.

The central finding of Paper IV was that the community, w.r.t net-loss, performs significantly better under a learned dynamic price mechanism than under the benchmark fixed price mechanism, and furthermore, a community under such a dynamic price mechanism stands an odds of 2 to 1 in increased financial savings.

Keywords: Reinforcement Learning, Multi-Agent Reinforcement Learning, Buildings, Indoor Climate, Occupant Comfort, Positive Energy Districts, Peer-to-Peer Markets, Complex Adaptive Systems

Ross May, School of Information and Engineering, Microdata Analysis

~

MAY, Ross, 2023. On the Feasibility of Reinforcement Learning in Single-and Multi-Agent Systems: The Cases of Indoor Climate and Prosumer Electricity Trading Communities. Dalarna University.

ISBN 978-91-88679-40-6 urn:nbn:se:du-45300 (http://urn.kb.se/resolve?urn=urn:nbn:se:du-45300) page pdf

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Mengjie Han, Ross May, Xingxing Zhang, Xinru Wang, Song Pan, Da Yan, Yuan Jin, and Liguo Xu. "A review of reinforcement learning methodologies for controlling occupant comfort in buildings", Sustainable Cities and Society, 51:101748, July 2019.

II Mengjie Han, Ross May, Xingxing Zhang, Xinru Wang, Song Pan, Da Yan, and Yuan Jin. "A novel reinforcement learning method for improving occupant comfort via window opening and closing", Sustainable Cities and Society, 61:102247, May 2020.

III Ross May, Pei Huang. "A multi-agent reinforcement learning approach for investigating and optimising peer-to-peer prosumer energy markets", Applied Energy, 334:120705, January 2023.

IV Ross May, Kenneth Carling, Pei Huang. "Does a smart agent overcome the tragedy of the commons in residential prosumer communities?", (Under submission to Energy Policy), January 2023.

Reprints were made with permission from the publishers.

[…] For the reader who wants a succinct summary of the main purpose of the thesis and the contributions herein to the scientific literature, I urge you to go straight to […] the concluding summary to this thesis.

4.1 Concluding Discussion

Wrapping everything up, I’d like to make some concluding remarks about what we have seen so far and what we can expect to see. We begun this overview by first considering Agenda 2030 and its SDGs, and in particular, SDG 7 – the goal on energy, the most pressing goal w.r.t climate mitigation and for achieving many of the other SDGs [26, 54]. We then saw two core strategies for achieving the environmental aspect of the so called energy transition, namely, increasing the share of renewable energy and increasing energy efficiency [26]. Focusing on the these w.r.t. the biggest contributors to climate change, namely, buildings, our attention was drawn to the notion of building control as a key enabler to realising the goals embodied in these strategies, and in particular, how the co-evolution of the smart city and the smart grid offers the opportunity for the smart management of buildings that facilitates, not only the environmental aspect of the energy transition, but also its social aspect embodied in Target 1 of Table 1.1.

Indeed, in seeking energy efficiency we must also consider seriously the inhabitants of these buildings for otherwise we risk creating an environment which is simply inhabitable, consequently resulting in ill health, lack of morale, losses in work efficiency [38], et cetera, and thus ultimately leading to an unsustainable building environment. And so, the comfort of occupants in the drive towards energy efficient buildings plays a critical role in this journey and the need for "intelligent" controls that are able to find the right balance between both occupant comfort and energy efficiency are required. Similarly, missed opportunities to share energy among peers in a prosumer community network not only increases reliance on the parent grid thereby reducing SS, it also increases the economic cost to the community [40, 61]. Thus, extending intelligent controls to the management of distributed renewable energy resources, that can aid in increasing SS while decreasing net-cost in the community, is also seen as critical to achieving the energy transition.

Given the complexity of such an undertaking, attacking it with purely modelbased solutions seem, as has been argued, an unsustainable approach. This observation leads one to consider model-free solutions as an alternative path, and in particular, RL, a method which closely resembles the way we humans learn, with ever increasing support for RL as one of the functions of the human brain [13]. My belief is that, with the help of RL, we can achieve the kind of occupant- and consumer-centric control alluded to above, which will ultimately take us a step closer to achieving sustainable cities and communities, and in particular, sustainable energy. To this end, the aim of this thesis has been on understanding the feasibility of model-free RL in single- and multiagent energy systems and in so doing has considered two cases, namely, the case of indoor climate – a single agent context, and the case of prosumer electricity trading communities – a multi-agent context.

In the case of indoor climate, Papers I and II considered the feasibility of model-free RL as a control method for occupant comfort. Indeed, in the literature, there is already evidence of its feasibility from the energy dimension and one extensive review highlighting this is the work by Vázquez-Canteli and Nagy [68]. From the comfort dimension, however, a comprehensive review addressing the feasibility of RL was found to be lacking. Paper I addressed this gap in knowledge, and the general finding was overall positive in terms of the feasibility of model-free RL as a control method in this domain. Indeed, of the limited works in this area, it was found that the model-free RL technique achieved at least as good as, and in many of the cases, better results than other methods considered in the studies both in terms of OC and energy efficiency. Interestingly, there was one instance, under the assumption of a perfect model of environmental dynamics, that a model-based RL control outperformed its model-free counterpart, however, in the same study under an imperfect model of the dynamics model-free RL was superior [46]. Another finding of Paper I is the clear lack of works exploring model-free RL for OC control in buildings, for which there are many important problems to explore. Indeed, one such problem, in the context of a naturally ventilated building, is building ventilation via the control of window systems.

Paper II addressed this problem by considering a case study using secondary data collected from an office building in Beijing during the transition season (March 16 - May 15, 2015). The experimental room consisted of a single door and south pointing push-pull window, with the same single occupant following the university working routine during the data collection period. Assuming a discrete state and action space, two tabular model-free RL methods, namely, Q-learning and Sarsa, were used as our testing algorithms for adaptive window control, and a long short-term memory recurrent neural network was implemented as a model for simulating the change in indoor temperature w.r.t. an action in the environment. Using the occupant data as a natural benchmark for evaluating the agents performance, we found an improvement in thermal comfort and indoor air quality of more than 90% when compared with the occupant’s behavioural data.

Papers I and II provide strong evidence for the feasibility of model-free RL as a solution method in the domain of building OC. That said, we are, however, still at the early stages of this journey and there remains much more work to be done. For example, the integration of computation and communications infrastructures with BMSs seems to be quite fuzzy – lacking in many cases implementation details. The feasibility of cooperative multi-agent RL (MARL) for controlling the indoor environment has been limited. There have been no studies examining the performance of cooperative MARL for multioccupant/multi-zonal settings. From a practical point of view, solving this ought to enable more efficient building operation while respecting OC. There are still many open questions regarding the best way to incorporate occupancy patterns, as well as human feedback into the control logic, which, as already alluded to earlier in this thesis, plays a crucial role in true occupant-centric control. The application of deep RL in the domain of OC control has been limited and there remains many avenues to be explored here. The coupling of comfort components such as IAQ and visual comfort within BCSs has had far less attention than thermal comfort, but the comfort level of an occupant should be viewed holistically, and where possible all comfort components should be considered.

In the case of prosumer electricity trading communities we broaden the scope of intelligent control to multiple buildings, and in this case, multiple household buildings, and consider the pressing problem of renewable energy trading among fellow peers in a community-based P2P prosumer energy market. Indeed, solving this problem not only aids in increasing the share of renewable energy in the global energy mix, but it also takes us a step closer to achieving the social aspect given in Table 1.1, namely, energy security. Papers III and IV take up this task, and consider the feasibilty of cooperative model-free MARL as a solution method for approaching the problem of incentivising peers in such a market to behave in a way that is, not only conducive to grid decarbonisation, but also cost saving to consumers and profitable to prosumers.

Existing literature approaching this problem, while important in their own right and advance the field in this area, have, in my opinion, disregarded in some way or other, the complex adaptaive nature of such a system. To address many of the limitations in existing literature around this problem, Paper III proposed a MARL-solution inspired by the work of Zheng et al. [74]. By extending their economic simulation framework, Foundation, and based on the observation that an optimal dynamic price mechanism should help to balance the trade-off between price and willingness to share, we sought to learn a "price planner" - analogous to a "social planner" - based on data-driven simulations using data from a virtual residential building community located in Sweden. Using the most natural benchmark to compare the performance of the community under such a price planner, we also simulated the community under a fixed price mechanism - the midpoint between the grid feed-in price and the grid consumer price. As social performance metrics, we chose to use total net-loss - the difference between the cost of buying minus the income gained from selling energy, SS, and income equality. The general finding of Paper III answered in the affirmative the underlying hypothesis, namely, that a dynamic price planner is able to implicitly coordinate the community in a way that leads to a better social outcome. In particular, the community under the dynamic price planner increased net-profit by more than 28%, and increased median SS - which was already considered high in the fixed price scenario, by more than 2%. As a proof-of-concept Paper III has identified the proposed MARL-solution as a feasible approach for studying such P2P market structures and has highlighted the existence of a social outcome that aligns with the environmental and social goals embodied in the targets of Table 1.1.

A limitation of Paper III, however, is that it was conducted in a single environment. Indeed, existing studies have shown that different climates and locations, as well as differing community scales, impact the energy sharing performance of communities [31]. Hence, Paper IV sought to assess the general validity of the proposed MARL-solution by conducting a full factorial experiment on the factors of climate - manifested in heterogeneous demand/supply profiles and associated battery parameters, community scale, and price mechanism - the benchmark fixed price mechanism versus a dynamic price planner. Real demand data from three distinct countries, namely, Canada, Germany, and Sweden were used to drive the experimental simulations. Using the same performance metrics as in Paper III to compare the performances of the different treatments, the general finding of Paper IV showed that the community under a learned price planner performed overall significantly better than under a fixed price mechanism, and, moreover, found that, in general, a community stands an odds of 2 to 1 of lower net-loss under the learned dynamic price mechanism compared to the benchmark fixed price mechanism. Furthermore, two additional hypotheses, namely, that the difference in performance of the community under a fixed price versus a dynamic price planner will not be affected by climate and that the difference in performance will be more pronounced under larger scale communities were also tested. In the former case we found, contrary to the hypothesis, that differences in performance are affected by climate, and in the latter that, while scale significantly impacts the performance of communities, it is actually under smaller scale communities where we see this happening. We also observed a general pattern in learning, both in the fixed price and dynamic price scenarios, that is both very slow and volatile.

Papers III and IV offer strong evidence in support of model-free cooperative MARL as a feasible approach to studying community-based P2P prosumer networks and, furthermore, show the existence of a better social outcome under a dynamic price planner. We also identify the community behaving unexpectedly under the price planner, that is to say, we observe climate impacting the differences in performance between the two interventional mechanisms differently, with differences pronounced under smaller scale communities.

Why is the price planner not behaving as we expected? Given the slow and volatile learning that was observed, one conjecture is that this might have something to do with the fact that no state of memory component was included in the neural-network architectures of the policy networks. Such a component would improve both the performance in the community, and the price planner, but at the additional cost of compute.

Given the novelty of this work, there remain avenues for future research. For example, investigating the learning outcomes under the addition of an LSTM component as part of the policy networks. Will we see smoother learning in this case, and does this result in an outcome that aligns with the additional hypotheses alluded to earlier? Another interesting avenue for future research would be to consider two or more price planners. Indeed, one can imagine assigning each to different clusters of the community, clusters that preserve the same heterogeneity with regards to energy sharing performance. By doing this, a bigger portion of the planner policy space can be explored which could help the community’s learning performance to stabalise, that is, to be less volatile, and, furthermore, might also lead to a better financial outcome overall for the community.

The open issues listed above, and including those in the case of indoor climate, are not an exhaustive list and there are, I’m sure, many other open issues. But in terms of this thesis, and the knowledge I have up to this point in time, these are the ones which come to my attention as some of the most pressing ones. It is my belief that solving these will take us a step further in achieving Agenda 2030.