Temat: reinforcement learning

Skocz do pozycji: 1.

Tytuł:: Deep reinforcement learning overview of the state of the art
Autorzy:: Fenjiro, Y.
Benbrahim, H.
Tematy:: reinforcement learning
deep learning
convolutional network
recurrent network
deep reinforcement learning; Pokaż więcej
Wydawca:: Sieć Badawcza Łukasiewicz - Przemysłowy Instytut Automatyki i Pomiarów
Powiązania:: https://bibliotekanauki.pl/articles/384788.pdf Link otwiera się w nowym oknie
Opis:: Artificial intelligence has made big steps forward with reinforcement learning (RL) in the last century, and with the advent of deep learning (DL) in the 90s, especially, the breakthrough of convolutional networks in computer vision field. The adoption of DL neural networks in RL, in the first decade of the 21 century, led to an end-toend framework allowing a great advance in human-level agents and autonomous systems, called deep reinforcement learning (DRL). In this paper, we will go through the development Timeline of RL and DL technologies, describing the main improvements made in both fields. Then, we will dive into DRL and have an overview of the state-ofthe- art of this new and promising field, by browsing a set of algorithms (Value optimization, Policy optimization and Actor-Critic), then, giving an outline of current challenges and real-world applications, along with the hardware and frameworks used. In the end, we will discuss some potential research directions in the field of deep RL, for which we have great expectations that will lead to a real human level of intelligence.
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Skocz do pozycji: 2.

Tytuł:: A compact DQN model for mobile agents with collision avoidance
Autorzy:: Kamola, Mariusz
Tematy:: Q‐learning
DQN
reinforcement learning; Pokaż więcej
Wydawca:: Sieć Badawcza Łukasiewicz - Przemysłowy Instytut Automatyki i Pomiarów
Powiązania:: https://bibliotekanauki.pl/articles/27314243.pdf Link otwiera się w nowym oknie
Opis:: This paper presents a complete simulation and reinforce‐ ment learning solution to train mobile agents’ strategy of route tracking and avoiding mutual collisions. The aim was to achieve such functionality with limited resources, w.r.t. model input and model size itself. The designed models prove to keep agents safely on the track. Colli‐ sion avoidance agent’s skills developed in the course of model training are primitive but rational. Small size of the model allows fast training with limited computational resources.
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Skocz do pozycji: 3.

Tytuł:: Prioritized epoch - incremental Q - learning algorithm
Autorzy:: Zajdel, R.
Tematy:: reinforcement learning
Q-learning
grid world; Pokaż więcej
Wydawca:: Polska Akademia Nauk. Czytelnia Czasopism PAN
Powiązania:: https://bibliotekanauki.pl/articles/375619.pdf Link otwiera się w nowym oknie
Opis:: The basic reinforcement learning algorithms, such as Q-learning or Sarsa, are characterized by short time-consuming single learning step, however the number of epochs necessary to achieve the optimal policy is not acceptable. There are many methods that reduce the number of' necessary epochs, like TD(lambda greather than 0), Dyna or prioritized sweeping, but their computational time is considerable. This paper proposes a combination of Q-learning algorithm performed in the incremental mode with the method of acceleration executed in the epoch mode. This acceleration is based on the distance to the terminal state. This approach ensures the maintenance of short time of a single learning step and high efficiency comparable with Dyna or prioritized sweeping. Proposed algorithm is compared with Q(lambda)-learning, Dyna-Q and prioritized sweeping in the experiments of three grid worlds. The time-consuming learning process and number of epochs necessary to reach the terminal state is used to evaluate the efficiency of compared algorithms.
Efektywność podstawowych algorytmów uczenia ze wzmocnieniem Q-learning i Sarsa, mierzona liczbą prób niezbędnych do uzyskania strategii optymalnej jest stosunkowo niewielka. Stąd też możliwości praktycznego zastosowania tego algorytmu są niewielkie. Zaletą tych podstawowych algorytmów jest jednak niewielka złożoność obliczeniowa, sprawiająca, że czas wykonania pojedynczego kroku uczenia jest na tyle mały, że znakomicie sprawdzają się one w systemach sterowania online. Stosowane metody przyśpieszania procesu uczenia ze wzmocnieniem, które pozwalająna uzyskanie stanu absorbującego po znacznie mniejszej liczbie prób, niż algorytmy podstawowe powodują najczęściej zwiększenie złożoności obliczeniowej i wydłużenie czasu wykonania pojedynczego kroku uczenia. Najczęściej stosowane przyśpieszanie metodą różnic czasowych TD(lambda znak większości 0) wiąże się z zastosowaniem dodatkowych elementów pamięciowych, jakimi są ślady aktywności (eligibility traces). Czas wykonania pojedynczego kroku uczenia w takim algorytmie znacznie się wydłuża, gdyż w odróżnieniu od algorytmu podstawowego, gdzie aktualizacji podlegała wyłącznie funkcja wartości akcji tylko dla stanu aktywnego, tutaj aktualizację przeprowadza się dla wszystkich stanów. Bardziej wydajne metody przyśpieszania, takie jak Dyna, czy też prioritized sweeping również należą do klasy algorytmów pamięciowych, a ich główną ideą jest uczenie ze wzmocnieniem w oparciu o adaptacyjny model środowiska. Metody te pozwalają na uzyskanie stanu absorbującego w znacznie mniejszej liczbie prób, jednakże, na skutek zwiększonej złożoności obliczeniowej, czas wykonania pojedynczego kroku uczenia jest już istotnym czynnikiem ograniczającym zastosowanie tych metod w systemach o znacznej liczbie stanów. Istotą tych algorytmów jest dokonywanie ustalonej liczby aktualizacji funkcji wartości akcji stanów aktywnych w przeszłości, przy czym w przypadku algorytmu Dyna są to stany losowo wybrane, natomiast w przypadku prioritized sweeping stany uszeregowane wg wielkości błędu aktualizacji. W niniejszym artykule zaproponowano epokowo-inkrementacyjny algorytm uczenia ze wzmocnieniem, którego główną ideą jest połączenie podstawowego, inkrementacyjnego algorytmu uczenia ze wzmocnieniem Q-lerning z algorytmem przyśpieszania wykonywanym epokowo. Zaproponowana metoda uczenia epokowego w głównej mierze opiera się na rzeczywistej wartości sygnału wzmocnienia obserwowanego przy przejściu do stanu absorbującego, który jest następnie wykładniczo propagowany wstecz w zależności od estymowanej odległości od stanu absorbującego. Dzięki takiemu podej- ściu uzyskano niewielki czas uczenia pojedynczego kroku w trybie inkrementacyjnym (Tab. 2) przy zachowaniu efektywności typowej dla algorytmów Dyna, czy też prioritized sweeping (Tab. 1 i Fig. 5).
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Skocz do pozycji: 4.

Tytuł:: Accidental exploration through value predictors
Autorzy:: Leśniak, Damian
Kisielewski, Tomasz
Opis:: Infinite length of trajectories is an almost universal assumption in the theoretical foundations of reinforcement learning. In practice learning occurs on finite trajectories. In this paper we examine a specific result of this disparity, namely a strong bias of the time-bounded Every-visit Monte Carlo value estimator. This manifests as a vastly different learning dynamic for algorithms that use value predictors, including encouraging or discouraging exploration. We investigate these claims theoretically for a one dimensional random walk, and empirically on a number of simple environments. We use GAE as an algorithm involving a value predictor and evolution strategies as a reference point.
Dostawca treści:: Repozytorium Uniwersytetu Jagiellońskiego

Artykuł

na półce

Skocz do pozycji: 5.

Tytuł:: Epokowo-inkrementacyjny algorytm uczenia się ze wzmocnieniem wykorzystujący kryterium średniego wzmocnienia
The epoch-incremental reinforcement learning algorithm based on the average reward
Autorzy:: Zajdel, R.
Tematy:: uczenie się ze wzmocnieniem
R-learning
algorytm epokowo-inkrementacyjny
average reward reinforcement learning
epoch-incremental reinforcement learning; Pokaż więcej
Wydawca:: Stowarzyszenie Inżynierów i Techników Mechaników Polskich
Powiązania:: https://bibliotekanauki.pl/articles/152882.pdf Link otwiera się w nowym oknie
Opis:: W artykule zaproponowano nowy, epokowo – inkrementacyjny algorytm uczenia się ze wzmocnieniem. Główną ideą tego algorytmu jest przeprowadzenie w trybie epokowym dodatkowych aktualizacji strategii w oparciu o odległości aktywnych w przeszłości stanów od stanu terminalnego. Zaproponowany algorytm oraz algorytmy R(0)-learning, R(λ)-learning, Dyna-R oraz prioritized sweeping-R zastosowano do sterowania modelem samochodu górskiego oraz modelem kulki umieszczonej na balansującej belce.
The application of the average reward reinforcement learning algorithms in the control were described in this paper. Moreover, new epoch-incremental reinforcement learning algorithm (EIR(0)-learning for short) was proposed. In this algorithm, the basic R(0)-learning algorithm was implemented in the incremental mode and the environment model was created. In the epoch mode, on the basis of the model, the distances of past active states to the terminal state were determined. These distances were then used in the update strategy. The proposed algorithm was applied to mountain car (Fig. 4) and ball-beam (Fig. 5) models. The proposed EIR(0)-learning was empirically compared to R(0)-learning [4, 6], R(λ)-learning and model based algorithms: Dyna-R and prioritized sweeping-R [11]. In the case of ball-beam system, EIR(0)-learning algorithm reached the stable control strategy after the smallest number of trials (Tab. 1, column 2). For the mountain car system, the number of trials was smaller than in the case of R(0)-learning and R(λ)-learning algorithms, but greater than for Dyna-R and prioritized sweeping-R. It is worth to pay attention to the fact that the execution times of Dyna-R and prioritized sweeping-R algorithms in the incremental mode were respectively 5 and 50 times longer than for proposed EIR(0)-learning algorithm (Tab. 2, column 3). The main conclusion of this work is that the epoch – incremental learning algorithm provided the stable control strategy in relatively small number of trials and in short time of single iteration.
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Skocz do pozycji: 6.

Tytuł:: Deep-Hedge MCCFR: An algorithm for solving imperfect information games.
Deep-Hedge MCCFR: Algorytm rozwiązywania gier z niepełną informacją
Autorzy:: Ziemiński, Marcin
Opis:: W ostatnim czasie poczyniono znaczne postępy w dziedzinie algorytmów uczących się rozwiązywać trudne problemy decyzyjne. Dominujące rozwiązania oparte o deep reinforcement learning (Deep RL) okazały się niezwykle skuteczne, pokonując najlepszych ludzkich graczy w gry takie jak Go, StarCraft II lub te pochodzące z Atari. Algorytmy te dostosowują swoje zachowania na podstawie zgromadzonego doświadczenia i otrzymywanych sygnałów.Jednak z drugiej strony, metody te nie są dobrze dostosowane do środowisk wieloagentowych, które z natury są niestacjonarne. Agenci funkcjonujący w takim kontekście równolegle kształtują swoje strategie, wpływając na zmienny charakter całego środowiska. Ponadto, działają oni w realiach niepełnej informacji, gdyż nie mają wglądu w wiedzę, do której dostęp mają wyłącznie ich przeciwnicy. To wszystko sprawia, że problem optymalizacji jest trudniejszy, ale przy tym podatny na analizę z punktu widzenia teorii gier.Teoria gier znajduje się u podstaw dominujących algorytmów dla gier o niepełnej informacji. Program o nazwie Pluribus, bazujący na algorytmie Counterfactual Regret Minimization (CFR), pokonał czołowych profesjonalnych graczy w sześcioosobowym No-Limit Texas Hold'em. CFR to adaptacyjny algorytm, który agreguje informacje dla każdego stanu gry w sposób tabelaryczny, co służy do usprawnienia strategii w następujących po sobie iteracjach. Jednak zastosowanie tego algorytmu w przypadku gier o dużym rozmiarze wymaga szerokiej wiedzy eksperckiej i przeszukiwania drzewa gry w czasie wykonania.W tej pracy staramy się połączyć dwa powyższe paradygmaty. Proponujemy rozwiązanie oparte o CFR, które nawiązuje do rozwoju dokonanego w dziedzinie, którą jest deep learning. Nasz algorytm, który nazywamy Deep-Hedge MCCFR, modeluje strategie poszczególnych agentów przy użyciu sieci neuronowych. Strategie te są ulepszane dzięki doświadczeniu zdobytemu podczas symulowanej gry. Pokazujemy, że proponowany algorytm osiąga w praktyce dobre wyniki dla różnych gier, nie wymagając przy tym wiedzy eksperckiej.
There have been much progress in the area of training agents to tackle difficult sequential decision-making problems. The dominating solutions from the realm of deep reinforcement learning (Deep RL) proved to be extremely successful surpassing the top human players at Atari games or Go. These algorithms adapt their behaviours directly from their experience and received rewards. But on the downside, the Deep RL methods are not well suited to multi-agent environments, which are non-stationary by nature, as the probability distribution over possible outcomes may change over time. The agents within the same environment learn their strategies in parallel and modify their behaviour. What is more, they may have access to private information, concealed from the others. This makes the optimization problem more difficult, but at the same time amenable to game-theoretic analysis. Game theory is at the centre of the prevailing approaches for solving imperfect information competitive games. The Counterfactual Regret Minimization (CFR) algorithm was the base for the poker AI called Pluribus, which defeated the best human professionals in six-player no-limit Texas hold'em. CFR is an adaptive tabular algorithm, which means that it aggregates information per each state of the game individually and uses it to iteratively adapt the strategy. However, in order to be feasible for large games, it requires a significant portion of domain-specific knowledge and a look-ahead search during the execution. In this work we try to combine the two aforementioned paradigms. We propose an algorithm founded on Counterfactual Regret Minimization, which utilizes the advancements of deep learning. Our method, called Deep-Hedge MCCFR, uses neural networks to model the agents' strategies. The strategies are improved through the experience gathered during self-play. We show that the algorithm does not require domain expertise and is applicable to various scenarios in unmodified form.
Dostawca treści:: Repozytorium Uniwersytetu Jagiellońskiego

Inne

na półce

Skocz do pozycji: 7.

Tytuł:: Discrete uncertainty quantification for offline reinforcement learning
Autorzy:: Pérez Torres, Jose Luis
Corrochano Jiménez, Javier
García, Javier
Majadas, Rubén
Ibañez-Llano, Cristina
Pérez, Sergio
Fernández, Fernando
Tematy:: off-line reinforcement learning
uncertainty quantification
machine learning; Pokaż więcej
Wydawca:: Społeczna Akademia Nauk w Łodzi. Polskie Towarzystwo Sieci Neuronowych
Powiązania:: https://bibliotekanauki.pl/articles/23944835.pdf Link otwiera się w nowym oknie
Opis:: In many Reinforcement Learning (RL) tasks, the classical online interaction of the learning agent with the environment is impractical, either because such interaction is expensive or dangerous. In these cases, previous gathered data can be used, arising what is typically called Offline RL. However, this type of learning faces a large number of challenges, mostly derived from the fact that exploration/exploitation trade-off is overshadowed. In addition, the historical data is usually biased by the way it was obtained, typically, a sub-optimal controller, producing a distributional shift from historical data and the one required to learn the optimal policy. In this paper, we present a novel approach to deal with the uncertainty risen by the absence or sparse presence of some state-action pairs in the learning data. Our approach is based on shaping the reward perceived from the environment to ensure the task is solved. We present the approach and show that combining it with classic online RL methods make them perform as good as state of the art Offline RL algorithms such as CQL and BCQ. Finally, we show that using our method on top of established offline learning algorithms can improve them.
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Skocz do pozycji: 8.

Tytuł:: Use of Modified Adaptive Heuristic Critic Algorithm for Novel Scheduling Mechanism in Packet-Switched Networks
Autorzy:: Jednoralski, M.
Kacprzak, T.
Tematy:: reinforcement learning
telecommunication networks
packet scheduling; Pokaż więcej
Wydawca:: Uniwersytet Przyrodniczo-Humanistyczny w Siedlcach
Powiązania:: https://bibliotekanauki.pl/articles/92909.pdf Link otwiera się w nowym oknie
Opis:: In this paper a novel scheduling algorithm of packet selection in a switch node for transmission in a network channel, based on Reinforcement Learning and modified Adaptive Heuristic Critic is introduced. A comparison of two well known scheduling algorithms: Earliest Deadline First and Round Robin shows that these algorithms perform well in some cases, but they cannot adapt their behavior to traffic changes. Simulation studies show that novel scheduling algorithm outperforms Round Robin and Earliest Deadline First by adapting to changing of network conditions.
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Skocz do pozycji: 9.

Tytuł:: Adaptive controller design for electric drive with variable parameters by Reinforcement Learning method
Autorzy:: Pajchrowski, T.
Siwek, P.
Wójcik, A.
Tematy:: Reinforcement Learning
adaptive control
electric drive
machine learning; Pokaż więcej
Wydawca:: Polska Akademia Nauk. Czytelnia Czasopism PAN
Powiązania:: https://bibliotekanauki.pl/articles/201068.pdf Link otwiera się w nowym oknie
Opis:: The paper presents a method for designing a neural speed controller with use of Reinforcement Learning method. The controlled object is an electric drive with a synchronous motor with permanent magnets, having a complex mechanical structure and changeable parameters. Several research cases of the control system with a neural controller are presented, focusing on the change of object parameters. Also, the influence of the system critic behaviour is researched, where the critic is a function of control error and energy cost. It ensures long term performance stability without the need of switching off the adaptation algorithm. Numerous simulation tests were carried out and confirmed on a real stand.
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Skocz do pozycji: 10.

Tytuł:: Multi agent deep learning with cooperative communication
Autorzy:: Simões, David
Lau, Nuno
Reis, Luís Paulo
Tematy:: multi-agent systems
deep reinforcement learning
centralized learning; Pokaż więcej
Wydawca:: Społeczna Akademia Nauk w Łodzi. Polskie Towarzystwo Sieci Neuronowych
Powiązania:: https://bibliotekanauki.pl/articles/1837537.pdf Link otwiera się w nowym oknie
Opis:: We consider the problem of multi agents cooperating in a partially-observable environment. Agents must learn to coordinate and share relevant information to solve the tasks successfully. This article describes Asynchronous Advantage Actor-Critic with Communication (A3C2), an end-to-end differentiable approach where agents learn policies and communication protocols simultaneously. A3C2 uses a centralized learning, distributed execution paradigm, supports independent agents, dynamic team sizes, partiallyobservable environments, and noisy communications. We compare and show that A3C2 outperforms other state-of-the-art proposals in multiple environments.
Dostawca treści:: Biblioteka Nauki

Artykuł

na półce

Informacja

Wyszukujesz frazę "reinforcement learning" wg kryterium: Temat