RNNAttensionで実装する作曲AI-学習編-【2022】

TIP
generative deep learningにて、生成型の機械学習の勉強をしている。その7章で作曲をAIで行う面白いプロジェクトがあったので、多少の説明と共に記載する。
RNNAttensionについてはモデル構成が記載されているものの、コードについての説明や実行に関するサポートは若干心もとない。
ということで、再度復習で同じプログラムを回すときのためにも備忘録として記録しておく。

前提#

pythonライブラリmusic21がインストール済みであること。
musescore3がインストール済みであること。
.music21が設定済みであること。
cuda等GPUの使用設定済みであること。
jupyter　notebookが使用できる環境であること。

環境設定の補足#

music21自体はpip install music21でインストール可能。筆者はlinuxのオンプレ機にインストールしているため、 aptでインストールしたmusescore3のパスを以下のように記入した。

1
<preference name="musescoreDirectPNGPath" value="/usr/bin/mscore3" />
2
<preference name="musicxmlPath" value="/usr/bin/mscore3" /

ライブラリのインストール#

1
import os
2
import pickle
3
import numpy
4
from music21 import note, chord
5

6
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
7
from tensorflow.keras.utils import plot_model
8

9
from models.RNNAttention import get_distinct, create_lookups, prepare_sequences, get_music_list, create_network

パラメータをセットする#

実行時にデータを保存するディレクトリは４つ。

store:以下四つのデータをためておく
- distincts: 明確さ？(RNNAttentionの理解が必要だ・・・)
- durations: 音の長さ
- lookups: 注目度？(RNNAttentionの理解が必要だ・・・)
- notes: 音程
output: 出力のMIDIファイルを格納する
weights: 重みを格納する
viz: TF2.2でエラーが出る場合の重みの保存場所らしい？

学習に用いたのはバッハのチェロ音源。

1
# run params
2
section = 'compose'
3
run_id = '0006'
4
music_name = 'cello'
5

6
run_folder = 'run/{}/'.format(section)
7
run_folder += '_'.join([run_id, music_name])
8

9
store_folder = os.path.join(run_folder, 'store')
10
data_folder = os.path.join('data', music_name)
11

12
if not os.path.exists(run_folder):
13
    os.mkdir(run_folder)
14
    os.mkdir(os.path.join(run_folder, 'store'))
15
    os.mkdir(os.path.join(run_folder, 'output'))
16
    os.mkdir(os.path.join(run_folder, 'weights'))
17
    os.mkdir(os.path.join(run_folder, 'viz'))
18

19
mode = 'build' # 'load' #
20

21
# data params
22
intervals = range(1)
23
seq_len = 32  # データを32個の音符の小さな塊へと分解する
24

25
# model params
26
embed_size = 100
27
rnn_units = 256
28
use_attention = True  # attensionを使用するか

noteを拡張する#

1
if mode == 'build':
2
    # data_folderに入っているmidiファイルのリストとパーサーを取得
3
    music_list, parser = get_music_list(data_folder)
4
    print(len(music_list), 'files in total')
5

6
    # 音程と音の長さを入れるリストを作成
7
    notes = []
8
    durations = []
9

10
    # パース開始
11
    for i, file in enumerate(music_list):
12
        print(i+1, "Parsing %s" % file)
13
        original_score = parser.parse(file).chordify()
14

15
        for interval in intervals:
16
            score = original_score.transpose(interval)
17

18
            notes.extend(['START'] * seq_len)
19
            durations.extend([0]* seq_len)
20

21
            for element in score.flat:
22

23
                if isinstance(element, note.Note):
24
                    if element.isRest:
25
                        notes.append(str(element.name))
26
                        durations.append(element.duration.quarterLength)
27
                    else:
28
                        notes.append(str(element.nameWithOctave))
29
                        durations.append(element.duration.quarterLength)
30

31
                if isinstance(element, chord.Chord):
32
                    notes.append('.'.join(n.nameWithOctave for n in element.pitches))
33
                    durations.append(element.duration.quarterLength)
34

35
    with open(os.path.join(store_folder, 'notes'), 'wb') as f:
36
        pickle.dump(notes, f) #['G2', 'D3', 'B3', 'A3', 'B3', 'D3', 'B3', 'D3', 'G2',...]
37
    with open(os.path.join(store_folder, 'durations'), 'wb') as f:
38
        pickle.dump(durations, f)
39
else:
40
    with open(os.path.join(store_folder, 'notes'), 'rb') as f:
41
        notes = pickle.load(f) #['G2', 'D3', 'B3', 'A3', 'B3', 'D3', 'B3', 'D3', 'G2',...]
42
    with open(os.path.join(store_folder, 'durations'), 'rb') as f:
43
        durations = pickle.load(f)

1
36 files in total
2
1 Parsing data/cello/cs5-4sar.mid
3
2 Parsing data/cello/cs1-5men.mid
4
3 Parsing data/cello/cs3-1pre.mid
5
4 Parsing data/cello/cs6-5gav.mid
6
5 Parsing data/cello/cs5-3cou.mid
7
6 Parsing data/cello/cs2-5men.mid
8
7 Parsing data/cello/cs1-1pre.mid
9
8 Parsing data/cello/cs3-3cou.mid
10
9 Parsing data/cello/cs5-5gav.mid
11
10 Parsing data/cello/cs3-6gig.mid
12
11 Parsing data/cello/cs2-6gig.mid
13
12 Parsing data/cello/cs3-4sar.mid
14
13 Parsing data/cello/cs6-4sar.mid
15
14 Parsing data/cello/cs6-6gig.mid
16
15 Parsing data/cello/cs4-3cou.mid
17
16 Parsing data/cello/cs4-5bou.mid
18
17 Parsing data/cello/cs5-2all.mid
19
18 Parsing data/cello/cs4-2all.mid
20
19 Parsing data/cello/cs2-3cou.mid
21
20 Parsing data/cello/cs1-6gig.mid
22
21 Parsing data/cello/cs1-4sar.mid
23
22 Parsing data/cello/cs3-2all.mid
24
23 Parsing data/cello/cs6-3cou.mid
25
24 Parsing data/cello/cs4-4sar.mid
26
25 Parsing data/cello/cs3-5bou.mid
27
26 Parsing data/cello/cs2-2all.mid
28
27 Parsing data/cello/cs6-2all.mid
29
28 Parsing data/cello/cs6-1pre.mid
30
29 Parsing data/cello/cs4-1pre.mid
31
30 Parsing data/cello/cs2-1pre.mid
32
31 Parsing data/cello/cs5-6gig.mid
33
32 Parsing data/cello/cs5-1pre.mid
34
33 Parsing data/cello/cs1-2all.mid
35
34 Parsing data/cello/cs4-6gig.mid
36
35 Parsing data/cello/cs2-4sar.mid
37
36 Parsing data/cello/cs1-3cou.mid

Create the lookup tables#

1
# get the distinct sets of notes and durations
2
note_names, n_notes = get_distinct(notes)
3
duration_names, n_durations = get_distinct(durations)
4
distincts = [note_names, n_notes, duration_names, n_durations]
5

6
with open(os.path.join(store_folder, 'distincts'), 'wb') as f:
7
    pickle.dump(distincts, f)
8

9
# lookup辞書を作って保存する
10
note_to_int, int_to_note = create_lookups(note_names)
11
duration_to_int, int_to_duration = create_lookups(duration_names)
12
lookups = [note_to_int, int_to_note, duration_to_int, int_to_duration]
13

14
with open(os.path.join(store_folder, 'lookups'), 'wb') as f:
15
    pickle.dump(lookups, f)
16

17
# lookupsのリストのひとつめに格納したnote_to_intを確認する
18
# たぶん、音程をどの数字に変換したかということ？
19
print('\nnote_to_int')
20
note_to_int

1
note_to_int
2
{'A2': 0,
3
 'A2.A3': 1,
4
 ...(略)
5
 'G5': 459,
6
 'START': 460}

1
# おそらく、音の長さをどの数字に変換したかを確認している？
2
print('\nduration_to_int')
3
duration_to_int

1
duration_to_int
2

3

4

5

6

7
{0: 0,
8
 Fraction(1, 12): 1,
9
 Fraction(1, 6): 2,
10
 0.25: 3,
11
 Fraction(1, 3): 4,
12
 Fraction(5, 12): 5,
13
 0.5: 6,
14
 Fraction(2, 3): 7,
15
 0.75: 8,
16
 1.0: 9,
17
 1.25: 10,
18
 Fraction(4, 3): 11,
19
 1.5: 12,
20
 1.75: 13,
21
 2.0: 14,
22
 2.25: 15,
23
 2.5: 16,
24
 3.0: 17,
25
 4.0: 18}

ニューラルネットワークで使用されるシーケンスを準備する#

1
network_input, network_output = prepare_sequences(notes, durations, lookups, distincts, seq_len)

1
# ピッチ情報を出力
2
print('pitch input')
3
print(network_input[0][0])
4

5
# 音の長さの情報を出力
6
print('duration input')
7
print(network_input[1][0])
8

9
print('pitch output')
10
print(network_output[0][0])
11

12
print('duration output')
13
print(network_output[1][0])

1
pitch input
2
[460 460 460 460 460 460 460 460 460 460 460 460 460 460 460 460 460 460
3
 460 460 460 460 460 460 460 460 460 460 460 460 460 460]
4
duration input
5
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
6
pitch output
7
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
8
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
9
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
10
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
11
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
12
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
13
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
14
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
15
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
16
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
17
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
18
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
19
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
20
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
21
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
22
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
23
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
24
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
25
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
26
 0. 0. 0. 0. 0.]
27
duration output
28
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

ニューラルネットワークの構造を作成する#

1
model, att_model = create_network(n_notes, n_durations, embed_size, rnn_units, use_attention)
2
# モデルのサマリーを作成
3
model.summary()

1
Model: "model"
2
__________________________________________________________________________________________________
3
Layer (type)                    Output Shape         Param #     Connected to
4
==================================================================================================
5
input_1 (InputLayer)            [(None, None)]       0
6
__________________________________________________________________________________________________
7
input_2 (InputLayer)            [(None, None)]       0
8
__________________________________________________________________________________________________
9
embedding (Embedding)           (None, None, 100)    46100       input_1[0][0]
10
__________________________________________________________________________________________________
11
embedding_1 (Embedding)         (None, None, 100)    1900        input_2[0][0]
12
__________________________________________________________________________________________________
13
concatenate (Concatenate)       (None, None, 200)    0           embedding[0][0]
14
                                                                 embedding_1[0][0]
15
__________________________________________________________________________________________________
16
lstm (LSTM)                     (None, None, 256)    467968      concatenate[0][0]
17
__________________________________________________________________________________________________
18
lstm_1 (LSTM)                   (None, None, 256)    525312      lstm[0][0]
19
__________________________________________________________________________________________________
20
dense (Dense)                   (None, None, 1)      257         lstm_1[0][0]
21
__________________________________________________________________________________________________
22
reshape (Reshape)               (None, None)         0           dense[0][0]
23
__________________________________________________________________________________________________
24
activation (Activation)         (None, None)         0           reshape[0][0]
25
__________________________________________________________________________________________________
26
repeat_vector (RepeatVector)    (None, 256, None)    0           activation[0][0]
27
__________________________________________________________________________________________________
28
permute (Permute)               (None, None, 256)    0           repeat_vector[0][0]
29
__________________________________________________________________________________________________
30
multiply (Multiply)             (None, None, 256)    0           lstm_1[0][0]
31
                                                                 permute[0][0]
32
__________________________________________________________________________________________________
33
lambda (Lambda)                 (None, 256)          0           multiply[0][0]
34
__________________________________________________________________________________________________
35
pitch (Dense)                   (None, 461)          118477      lambda[0][0]
36
__________________________________________________________________________________________________
37
duration (Dense)                (None, 19)           4883        lambda[0][0]
38
==================================================================================================
39
Total params: 1,164,897
40
Trainable params: 1,164,897
41
Non-trainable params: 0
42
__________________________________________________________________________________________________

1
#Currently errors in TF2.2
2
#plot_model(model, to_file=os.path.join(run_folder ,'viz/model.png'), show_shapes = True, show_layer_names = True)

ニューラルネットワークを学習させる#

1
weights_folder = os.path.join(run_folder, 'weights')
2
## 追加学習の場合、以下をコメントアウト外す
3
# model.load_weights(os.path.join(weights_folder, "weights.h5"))
4

5
weights_folder = os.path.join(run_folder, 'weights')
6

7
# 学習途中のチェックポイントを設定
8
checkpoint1 = ModelCheckpoint(
9
    os.path.join(weights_folder, "weights-improvement-{epoch:02d}-{loss:.4f}-bigger.h5"),
10
    monitor='loss',
11
    verbose=0,
12
    save_best_only=True,
13
    mode='min'
14
)
15

16
# 最終チェックポイントを設定
17
checkpoint2 = ModelCheckpoint(
18
    os.path.join(weights_folder, "weights.h5"),
19
    monitor='loss',
20
    verbose=0,
21
    save_best_only=True,
22
    mode='min'
23
)
24

25
# 早期学習終了の設定
26
early_stopping = EarlyStopping(
27
    monitor='loss'
28
    , restore_best_weights=True
29
    , patience = 10
30
)
31

32
callbacks_list = [
33
    checkpoint1
34
    , checkpoint2
35
    , early_stopping
36
 ]
37

38
model.save_weights(os.path.join(weights_folder, "weights.h5"))
39

40
# 学習
41
model.fit(network_input, network_output
42
          , epochs=2000000, batch_size=32
43
          , validation_split = 0.2
44
          , callbacks=callbacks_list
45
          , shuffle=True
46
         )
47
# epoch数がかなり多いが早期学習終了するのでこのままでよい

1
Epoch 1/2000000
2
720/720 [==============================] - 34s 47ms/step - loss: 4.3821 - pitch_loss: 3.5396 - duration_loss: 0.8425 - val_loss: 3.7145 - val_pitch_loss: 3.1518 - val_duration_loss: 0.5626
3
Epoch 2/2000000
4
720/720 [==============================] - 35s 48ms/step - loss: 3.8736 - pitch_loss: 3.2409 - duration_loss: 0.6327 - val_loss: 3.6074 - val_pitch_loss: 3.0458 - val_duration_loss: 0.5616
5
....(略)
6
Epoch 152/2000000
7
720/720 [==============================] - 35s 48ms/step - loss: 0.2035 - pitch_loss: 0.1727 - duration_loss: 0.0308 - val_loss: 7.7287 - val_pitch_loss: 6.2841 - val_duration_loss: 1.4446
8
Epoch 153/2000000
9
720/720 [==============================] - 35s 48ms/step - loss: 0.2063 - pitch_loss: 0.1768 - duration_loss: 0.0295 - val_loss: 7.6420 - val_pitch_loss: 6.2567 - val_duration_loss: 1.3853
10

11
<tensorflow.python.keras.callbacks.History at 0x7efdb6a86fd0>

まとめ#

今回は学習までの流れを多少の説明として記録した。コードだけで結構長くなるため、推論は次回に回そうと思う。テキストの学習などと同じく、音楽もデータ化し、学習可能だということを知り、実行してみることでかなり理解が深まった気がする。

生成deeplearningはこの章の後ろでmuseGANの解説もしているため、大変おすすめな本。作曲AIに興味のある方はぜひ購入してみてほしい。（特に利害関係者ではないです）

あとは、musicscoreを使って数式から音楽を生成したりもしてみたい。

一応先だしで生成した音源を貼っておく。

AI生成メロディリンク(SoundCloud)