Daddy Makers: 딥러닝 훈련용 대용량 이미지의 하둡 파일 준비 방법

이 글은 딥러닝 훈련용 이미지 데이터 준비 방법을 간단히 설명한다.

학습용 빅데이터 준비 순서
우선, 학습할 데이터 목적을 분명히 정한다. 훈련할 딥러닝 모델 종류에 따라 데이터 구조 및 형식이 적절히 변환되어야 할 수 있다. 훈련 및 검증 데이터 크기가 얼마 정도가 되어야 하는 지 결정한다. 대략 학습용 데이터를 준비하는 순서는 다음과 같다.

1. 학습용 빅데이터 활용 목적 결정
2. 빅데이터 형식, 구조 및 크기 결정
3. 빅데이터 수집 방법 결정
4. 빅데이터 관리 방법 결정
5. 빅데이터 수집
6. 빅데이터 정리
7. 빅데이터 라벨링 및 주석 작업
8. 작업된 데이터 품질 확인
9. 훈련 및 검증용 빅데이터 준비

이제 딥러닝 훈련용 하둡 데이터 준비 방법을 간단히 살펴보자.

딥러닝 훈련용 하둡 데이터
빅데이터중에 멀티미디어 데이터는 하둡파일로 저장해 관리하는 것이 편리하다. 비전과 관련된 많은 딥러닝 예제에서는 하둡파일인 HDF5를 사용한다. 참고로 하둡파일 구조는 다음 링크를 참고한다.

하둡 파일 구조 및 사용법

딥러닝에 필요한 데이터는 매우 많은데, 예를 들어, ImageNet에 있는 학습 데이터는 거의 2 백만 개 이미지이다. 이런 상황에서 모든 이미지를 메모리로 로드하고, 이미지 전처리를 적용한 후, 네트워크에 전달하여 훈련, 검증 또는 테스트하는 것은 현명하지 않다.

하나의 HDF5 파일에 많은 수의 이미지를 저장하고, 일괄적으로 데이터를 로딩할 수 있다. HDF5는 데이터를 관리, 조작, 압축 및 저장하는 기능을 제공한다. 이 글에서는 개와 고양이 이미지를 HDF5로 저장 및 로딩해본다.

이미지 및 레이블 지정
먼저 모든 이미지에 대한 레이블을 지정해야 한다. 각 고양이 이미지에 label = 0을 지정하고, 각 강아지 이미지에 label = 1을 지정한다. 그리고, 학습 모델 가중치가 특정한 학습 시기에 편중되지 않도록 데이터를 임의로 뒤집고 섞어야 한다. 데이터 세트는 훈련용 60%, 검사 20), 테스트 20%로 나눈다. 아래 예제는 machinelearninguru.com 을 참고하였다.

from random import shuffle
import glob
shuffle_data = True  # shuffle the addresses before saving
hdf5_path = 'Cat vs Dog/dataset.hdf5'  # address to where you want to save the hdf5 file
cat_dog_train_path = 'Cat vs Dog/train/*.jpg'

# read addresses and labels from the 'train' folder
addrs = glob.glob(cat_dog_train_path)
labels = [0 if 'cat' in addr else 1 for addr in addrs]  # 0 = Cat, 1 = Dog

# to shuffle data
if shuffle_data:
    c = list(zip(addrs, labels))
    shuffle(c)
    addrs, labels = zip(*c)
    
# Divide the hata into 60% train, 20% validation, and 20% test
train_addrs = addrs[0:int(0.6*len(addrs))]
train_labels = labels[0:int(0.6*len(labels))]

val_addrs = addrs[int(0.6*len(addrs)):int(0.8*len(addrs))]
val_labels = labels[int(0.6*len(addrs)):int(0.8*len(addrs))]

test_addrs = addrs[int(0.8*len(addrs)):]
test_labels = labels[int(0.8*len(labels)):]

HDF5 파일 생성
h5py 및 PyTables 같은 HDF5 형식을 생성하는 함수가 있다.

이미지를 저장하기 위해 각 이미지 데이터 세트마다 배열 구조를 정의해야 한다. 보통, 데이터 IMAGE_HEIGHT, IMAGE_WIDTH, image_depth 이다. 배열을 만들 때 데이터 유형은 dtype이다.

하둡파일 생성방법은 테이블 방식과 h5py 함수를 이용해 직접 생성하는 방식이 있다.

테이블 방식은 empty 배열을 생성하는 create_earray 를 사용할 수 있다. 여기에 데이터를 추가 할 수 있다. 레이블은 create_array를 사용하는 것이 더 편리하다. 배열 dtype을 설정하려면 uint8에 대해 tables.UInt8Atom()과 같은 테이블 dtype을 사용할 수 있다. create_earray 및 create_array 메소드 첫 번째 속성은 데이터 그룹을 작성하여 데이터를 관리 할 수있는 데이터 그룹 이다. 그룹은 HDF5 파일 폴더와 비슷하다.

h5py 방식은 create_dataset을 사용하여 배열을 만든다. 배열을 정의 할 때는 정확한 크기를 결정해야한다. 레이블에 create_dataset을 사용해 즉시 레이블을 지정할 수 있다. numpy 유형을 사용하여 배열의 dtype을 직접 설정할 수 있다.

아래는 테이블을 이용해 하둡파일을 생성한다.

import numpy as np
import tables

data_order = 'tf'  # 'th' for Theano, 'tf' for Tensorflow
img_dtype = tables.UInt8Atom()  # dtype in which the images will be saved

# check the order of data and chose proper data shape to save images
if data_order == 'th':
    data_shape = (0, 3, 224, 224)
elif data_order == 'tf':
    data_shape = (0, 224, 224, 3)

# open a hdf5 file and create earrays
hdf5_file = tables.open_file(hdf5_path, mode='w')

train_storage = hdf5_file.create_earray(hdf5_file.root, 'train_img', img_dtype, shape=data_shape)
val_storage = hdf5_file.create_earray(hdf5_file.root, 'val_img', img_dtype, shape=data_shape)
test_storage = hdf5_file.create_earray(hdf5_file.root, 'test_img', img_dtype, shape=data_shape)

mean_storage = hdf5_file.create_earray(hdf5_file.root, 'train_mean', img_dtype, shape=data_shape)

# create the label arrays and copy the labels data in them
hdf5_file.create_array(hdf5_file.root, 'train_labels', train_labels)
hdf5_file.create_array(hdf5_file.root, 'val_labels', val_labels)
hdf5_file.create_array(hdf5_file.root, 'test_labels', test_labels)

이제 이미지를 하나씩 읽어 전처리하고 저장한다.

# a numpy array to save the mean of the images
mean = np.zeros(data_shape[1:], np.float32)

# loop over train addresses
for i in range(len(train_addrs)):
    # print how many images are saved every 1000 images
    if i % 1000 == 0 and i > 1:
        print 'Train data: {}/{}'.format(i, len(train_addrs))

    # read an image and resize to (224, 224)
    # cv2 load images as BGR, convert it to RGB
    addr = train_addrs[i]
    img = cv2.imread(addr)
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # add any image pre-processing here

    # if the data order is Theano, axis orders should change
    if data_order == 'th':
        img = np.rollaxis(img, 2)

    # save the image and calculate the mean so far
    train_storage.append(img[None])
    mean += img / float(len(train_labels))

# loop over validation addresses
for i in range(len(val_addrs)):
    # print how many images are saved every 1000 images
    if i % 1000 == 0 and i > 1:
        print 'Validation data: {}/{}'.format(i, len(val_addrs))

    # read an image and resize to (224, 224)
    # cv2 load images as BGR, convert it to RGB
    addr = val_addrs[i]
    img = cv2.imread(addr)
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # add any image pre-processing here

    # if the data order is Theano, axis orders should change
    if data_order == 'th':
        img = np.rollaxis(img, 2)

    # save the image
    val_storage.append(img[None])

# loop over test addresses
for i in range(len(test_addrs)):
    # print how many images are saved every 1000 images
    if i % 1000 == 0 and i > 1:
        print 'Test data: {}/{}'.format(i, len(test_addrs))

    # read an image and resize to (224, 224)
    # cv2 load images as BGR, convert it to RGB
    addr = test_addrs[i]
    img = cv2.imread(addr)
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # add any image pre-processing here

    # if the data order is Theano, axis orders should change
    if data_order == 'th':
        img = np.rollaxis(img, 2)

    # save the image
    test_storage.append(img[None])

# save the mean and close the hdf5 file
mean_storage.append(mean[None])
hdf5_file.close()

HDF5 파일 읽기
데이터가 HDF5 파일에 올바르게 저장되었는지 확인해야한다. 이를 위해 임의의 크기의 배치로 데이터를 로드하고 처음 다섯번째 배치의 첫 번째 이미지를 출력해 본다. 또한 각 이미지의 레이블을 확인한다.

import tables
import numpy as np

hdf5_path = 'Cat vs Dog/dataset.hdf5'
subtract_mean = False

# open the hdf5 file
hdf5_file = tables.open_file(hdf5_path, mode='r')

# subtract the training mean
if subtract_mean:
    mm = hdf5_file.root.train_mean[0]
    mm = mm[np.newaxis, ...]

# Total number of samples
data_num = hdf5_file.root.train_img.shape[0]

이제 배치 목록을 만들고, 순서를 섞는다. 그리고, 각 배치의 모든 이미지를 한번에 로딩한다.

from random import shuffle
from math import ceil
import matplotlib.pyplot as plt

# create list of batches to shuffle the data
batches_list = list(range(int(ceil(float(data_num) / batch_size))))
shuffle(batches_list)

# loop over batches
for n, i in enumerate(batches_list):
    i_s = i * batch_size  # index of the first image in this batch
    i_e = min([(i + 1) * batch_size, data_num])  # index of the last image in this batch

    # read batch images and remove training mean
    images = hdf5_file.root.train_img[i_s:i_e]
    if subtract_mean:
        images -= mm

    # read labels and convert to one hot encoding
    labels = hdf5_file.root.train_labels[i_s:i_e]
    labels_one_hot = np.zeros((batch_size, nb_class))
    labels_one_hot[np.arange(batch_size), labels] = 1

    print n+1, '/', len(batches_list)

    print labels[0], labels_one_hot[0, :]
    plt.imshow(images[0])
    plt.show()
    
    if n == 5:  # break after 5 batches
        break

hdf5_file.close()

이 코드를 이용해, 배치로 로딩한 이미지 데이터셋을 이용한 딥러닝 모델 훈련, 검증을 할 수 있을 것이다. 이와 관련된 코드는 machinelearningrugu.com에서 제공한 Github페이지에서 확인할 수 있다.

만약 텐서플로우 레코드 형식인 TFRecords 로 처리하려면, 이 링크를 참고하라.

참고 - 딥러닝 데이터 취득 센서 스펙
무인자율차 등에 사용하는 데이터 취득용 센서는 환경에 따라 취득할 수 있는 데이터 종류, 여건 등이 다르다. 아래 표는 이를 정리한 것이다.

센서 스펙(Distributed Deep Learning with Hadoop and TensorFlow)

참고 - 딥러닝 데이터 취득 센서 스펙

머신러닝 플랫폼은 사용하는 레이어에 따라 여러가지가 될 수 있다. Flux는 데이터 저장, 학습, 시뮬레이션, 관리까지 아우르는 플랫폼이다. 시뮬레이션 및 센서 데이터 교환을 위해 ROS를 사용하고 있다.

참고 - 딥러닝 모델 데이터 종류
빅데이터를 저장하고 훈련 시 필요한 데이터를 딥러닝 모델에 공급하기 위해 데이터 형식을 구조화해야 한다. 데이터 구조는 앞서 설명한 바와 같이, 훈련, 검증 및 테스트로 구분하고, 각 데이터셋을 배치방식으로 로딩하기 위해 배치단위로 저장된다. 각 데이터는 학습 모델에 입력될 수 있는 데이터 구조로 정규화되어 저장되며, 이때 라벨 정보가 함께 있어야 한다.

데이터 활용 목적에 따라 취득될 수 있는 데이터 종류는 다음과 같다.

1. 이미지 형식
이 데이터 형식은 카메라로 얻은 사진, 동영상 뿐 아니라 신호 등 시공간 이미지 데이터 등을 모두 포함한다. 이미지는 RGB 픽셀 단위, 정수나 실수로 표현된 신호값 등으로 구성될 수 있다.
이미지는 프레임으로 구분되며, 프레임은 스트리밍(streaming) 가능한 형식으로 표현된다. 저장된 동영상 파일 포맷을 읽기 위해서는 포맷 해석을 위한 코덱(codec)이 필요하다. 보통, OpenCV같은 라이브러리는 이미지 영상 데이터를 읽고 쓸 수 있다.
신호같은 데이터는 행렬 형식으로 이미지를 만들고 저장할 수 있다.

2. 텍스트 형식
텍스트 형식은 이미지 보다는 구조나 포맷이 간단하고, 읽고 쓰기가 편리하다. 텍스트 형식은 훈련용, 검증용 데이터셋 종류와 갯수를 구분할 수 있는 헤더 정보를 정의한다.

3. 점군(포인트 클라우드) 형상
점군은 대용량 데이터로 수백만개 이상 포인트가 포함된 구조이다. 수백만개 점군을 직접 훈련용 데이터로 사용할 수는 없으니, 세그먼테이션하여 분리된 점군을 적절히 샘플링하여, 3차원 grid 형식으로 저장한다. 이때 voxel 구조 등을 사용하기도 한다.

기타 수치, 벡터 데이터, 관계 위상 정보 등이 있다.

레퍼런스

Daddy Makers

2018년 9월 21일 금요일

딥러닝 훈련용 대용량 이미지의 하둡 파일 준비 방법

댓글 없음:

댓글 쓰기