强化学习的数学原理


写在前面—— ——对于强化学习的建议

  • 不要有追求速成的想法
  • 对于自己的目标要分类合理的时间

Introduction

经典书籍

image-20230113121914133

课程目的

image-20230113121941318

image-20230113122213354

image-20230113122440140

贝尔曼公式

image-20230128204812340

image-20230128165517590

image-20230128170409931

image-20230128172712551

image-20230128175858154

image-20230128180228305

image-20230128180554256

image-20230128180738504

image-20230128180756111

image-20230128181552387

image-20230128182026388

image-20230128182318912

image-20230128182354645

image-20230128182456991

image-20230128183009226

image-20230128183604826

image-20230128183710147

通过计算state value来评价策略的好坏

image-20230128184124631

image-20230128184148158

动作价值函数

从下面的几个公式可以看出,state value 和action value 是可以相互推导的

image-20230128205247340

image-20230128205613554

image-20230128210313033

image-20230128210403215

image-20230128210541462

image-20230128210734706

贝尔曼最优公式

image-20230130103748502

image-20230129223633446

image-20230129230417648

我们要去求解这个优化问题,得到最优的$\pi$

image-20230129230658694

image-20230129230817266

image-20230129230936242

image-20230129231630304

image-20230129231938847

image-20230129232346610

image-20230130103703750

image-20230130104100582

image-20230130104232048

定理求解

$$x=f(x)$$

image-20230130104410641

image-20230130104744794

image-20230130104825732

image-20230130105052759

image-20230130105556156

image-20230130105714515

image-20230130105837611

虽然下面例子中,策略已经达到最优,但是state value还没有达到最优,所以需要继续迭代

image-20230130110306039

迭代不能无穷的进行下去,所以需要有终止条件,常见的是$|v_k-v_{k+1}|<\epsilon$

image-20230130110645402

image-20230130111422776

image-20230130111447362

image-20230130111532505

image-20230130113517281

image-20230130113734211

image-20230130113912467

image-20230130114018189

image-20230130114157800

image-20230130114247324

image-20230130114456254

最优策略的不变性

只考虑相对价值的大小

image-20230130115304172

一般为了防止绕远路,每一步会给$r=-1$,但其实,这个不止和$r$有关,和$\gamma$也有关系

image-20230130115738768

image-20230130115805408

image-20230130115829264

值迭代与策略迭代

image-20230131151048584

image-20230131152746219

image-20230131152917800

image-20230131153048452

image-20230131153207479

image-20230131153406730

image-20230131153527870

image-20230131153935160

image-20230131154240041

image-20230131154313455

image-20230131155308170

image-20230131162519860

image-20230131162654805

image-20230131162929616

image-20230131163123313

image-20230131163452411

image-20230131163529716

image-20230131163652331

image-20230131164125150

image-20230131164233162

image-20230131164544775

有一个现象:接近目标的状态策略会先变好

image-20230131165042213

image-20230131165100770

image-20230131172317995

image-20230131172454220

image-20230131172844456

image-20230131173305491

image-20230131173407019

image-20230131173415561

image-20230131173630141

image-20230131173744838

image-20230131173810065

image-20230131173912702

image-20230131174112970

蒙特卡洛方法

蒙特卡洛估计

image-20230202124108095

image-20230202124126444

image-20230202124138763

大数定律是根本保证

image-20230202123925277

image-20230202124042463

MC Basic

image-20230202124320625

image-20230202124427595

image-20230202124644112

image-20230202124801646

image-20230202125145680

image-20230202125450943

image-20230202125533926

image-20230202125837637

image-20230202130014373

image-20230202130137583

image-20230202130355046

image-20230202130505449

image-20230202130927495

image-20230202130633106

image-20230202130704935

image-20230202131521690

MC Exploring Starts

image-20230202152711518

image-20230202153302344

如果一个状态在一个序列中出现了多次,该怎么处理?这里对应两种方法,一种是只计算这个状态在序列中第一次出现的累积奖励,这种方法称为”first-visit MC method”;另一种方法是这个状态每次出现时都计算积累奖励,再取平均,称为”every-visit MC method”。

image-20230202154008533

image-20230202154310827

image-20230202154508194

image-20230202154850900

image-20230202155231347

image-20230202155324948

MC Epsilon-Greedy

image-20230202160022913

image-20230202160137321

image-20230202160514414

image-20230202160605322

image-20230202160714414

image-20230202160728966

image-20230202163009051

image-20230202163227722

image-20230202163750693

image-20230202163931356

image-20230202165028642

一开始$\epsilon$比较大,探索性比较强,然后$\epsilon$ 逐渐减小,到结束时$\epsilon=0$,这是就能得到一个比较优的策略

image-20230202170813951

image-20230202171135535

随机近似与随机梯度下降

SA

image-20230203105425941

image-20230203105830294

image-20230203105850116

image-20230203111814621

image-20230203111925791

image-20230203112050895

image-20230203112342972

image-20230203112520371

image-20230203112654610

image-20230203112727450

image-20230203113108829

image-20230203113452172

image-20230203113929155

image-20230203113954536

image-20230203114035967

image-20230203114158586

image-20230203114410698

练习:求解$(x-1)^2-1=0$

int main(){
    auto g = [=](double x) -> double{return (x-1)*(x-1)-1;};
    double w = 8,a = 100;
    for (int i = 1;i <= 1000;++ i){
        double w_t = w;
        w = w_t - (1./a)*g(w_t);
        printf("%.4lf\n",w);
        // a += 1;
    }

}

image-20230203152453651

image-20230203153633575

image-20230203153834010

image-20230203155829007

image-20230203155935511

image-20230203160258341

image-20230203160433484

image-20230203160610568

image-20230203160943881

image-20230203161037063

SGD

image-20230203161235154

image-20230203161450132

image-20230203161533590

image-20230203161815578

image-20230203162342322

image-20230203162438020

image-20230203162629102

image-20230203162744714

image-20230203162754157

image-20230203163057302

image-20230203163324552

当$w_k和w^*$比较远的时候,sgd基本等同于gd;而比较近的时候sgd呈现出较大的随机性误差

image-20230203163439189

image-20230203164005471

image-20230203164123744

image-20230203164351581

image-20230203164517251

image-20230203164755113

image-20230203164924835

image-20230203165049199

image-20230203165442286

时序差分方法

TD to estimate state value

image-20230205112031980

image-20230205112120723

image-20230205112604504

image-20230205112808211

image-20230205112912421

image-20230205113016778

image-20230205124156070

image-20230205124534274

image-20230205124813497

image-20230205125222962

image-20230205125902823

image-20230205130715738

image-20230205131516906

image-20230205132054107

image-20230205132753353

image-20230205132942743

image-20230205133200664

image-20230205133643459

image-20230205134104715

image-20230205134540088

Sarsa

image-20230207102242896

image-20230207102622781

image-20230207102746250

image-20230207102840285

image-20230207103320465

image-20230207103623975

image-20230207103814934

image-20230207104023335

image-20230207104340791

image-20230207104724791

image-20230207105245134

image-20230207105310980

image-20230207105706927

image-20230207110245482

Q-learning

image-20230207111538698

image-20230207111809368

image-20230207111852978

image-20230207112524832

image-20230207121317235

image-20230207121447117

image-20230207121924701

image-20230207122057561

image-20230207122530203

q-learning 的behavior policy 和 target policy可以相同也可以不同

image-20230207163419676

image-20230207163910118

一开始的探索性应该比较强,然后逐渐减小

image-20230207164037569

image-20230207164241897

image-20230207173111740

image-20230207173352322

值函数近似(value function approximation)

image-20230211142428438

image-20230211142803158

image-20230211143222558

image-20230211143404180

image-20230211143537470

image-20230211143659096

image-20230211143903384

image-20230211144242457

image-20230211144628223

image-20230211144854823

image-20230211145116235

image-20230211145420015

image-20230211145612752

开始的时候波动还比较大,随着访问次数的增多,曲线趋于平稳。曲线最后的星号代表着理论值,也就是说我们不通过运行很多次来找到$d_{\pi}$

image-20230211145858263

$P_{\pi}$是状态转移矩阵,$d_{\pi}$是$P_{\pi}$特征值为1的特征向量

image-20230211150639712

image-20230211151642757

image-20230211151845491

image-20230211152014865

image-20230211152151375

image-20230211153149852

image-20230211153304245

image-20230211153613837

image-20230211153903147

image-20230211153952413

image-20230211160404208

image-20230211160539443

image-20230211160749640

feature vector 中变量的顺序是可以调整的

image-20230211161211933

image-20230211161331149

image-20230211161911077

image-20230211161947512

image-20230211162701113

image-20230211162755506

image-20230211162942240

image-20230211163216602

image-20230211163934609

image-20230211164720056

image-20230211164813223

相比之前的Sarsa,这里value update 是更新的parameter $w$

image-20230211165020705

image-20230211165157524

image-20230211165206872

image-20230211165531000

image-20230211165839428

image-20230211170043228

image-20230211170433978

image-20230211170536711

image-20230211171045701

image-20230211171429856

image-20230211171531772

image-20230211172109613

image-20230211180747405

image-20230211180832698

image-20230211180931444

image-20230211181711407

与原论文中的有些出入,原论文中用了更多的技巧,效果也更好些

image-20230211192810072

image-20230211192838251

image-20230211193116064

image-20230211193431340

image-20230211193634339

策略梯度方法

image-20230212145911032

image-20230212145917670

image-20230212150200013

image-20230212150302905

image-20230212150456339

image-20230212150657242

image-20230212150740196

image-20230212150813737

image-20230212151325343

image-20230212151622690

image-20230212151647192

image-20230212152001092

image-20230212152135372

image-20230212152441221

image-20230212152456295

image-20230212152615727

image-20230212152713066

image-20230212152852927

image-20230212153140140

image-20230212153534039

image-20230212153708316

image-20230212153942480

image-20230212155544454

image-20230212155649348

image-20230212155739575

image-20230212155955234

image-20230212160119928

image-20230212160238576

image-20230212160517526

image-20230212160616825

image-20230212160714132

image-20230212161634186

image-20230212161714104

image-20230212161730133

image-20230212161910747

image-20230212162040803

image-20230212162619328

image-20230212162803778

image-20230212163201225

image-20230212163339897

image-20230212163351746

image-20230212164311111

Actor-Critic方法

image-20230213202744852

image-20230213203233118

image-20230213204941825

image-20230213205406480

image-20230213205839432

image-20230213210651419

image-20230213210926589

image-20230213211049483

image-20230213212533373

image-20230213212921312

image-20230213213117778

image-20230213213224197

image-20230213213520128

image-20230213214905245

image-20230213215220340

image-20230213215333269

image-20230214095731174

image-20230214095809899

image-20230214100109188

image-20230214100301077

image-20230214100504145

image-20230214100523502

image-20230214100723742

image-20230214100927134

image-20230214101023733

image-20230214101611484

image-20230214101811540

image-20230214102210305

image-20230214103120508

image-20230214103229775

image-20230214103353216

image-20230214103526978

image-20230214103822767

image-20230214104040101

image-20230214113829558

image-20230214113429288

image-20230214113811515

image-20230214114049380

image-20230214114504990

image-20230214114605551

image-20230214114851783

image-20230214115051173

image-20230214115151180

image-20230214115620567


文章作者: 马克图布
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 马克图布 !
评论
  目录