💭

생각생각생각생각

RDL Context

Tabular Data에 GNN을 접목 시키는 연구는 많이 진행되어왔음

다만 단일 테이블에 그치는 경우가 많았음

단일 테이블에 GNN을 연구한 사례로는

성능 측면 ⇒ Hypergraph 사용, feature interaction 발견

ii.

representation learning ⇒ Graph Structure Learning, SSL, Contrastive Learning

RDB 테이블로 이를 구현한 사례를 RelBench, RDBench, 4DBInfer 등이 존재함.

이는 모두 heterogeneous graph를 기반으로 Row2Node 기법 또는 Row2N/E 기법을 써서, 하나의 테이블을 하나의 노드로, relation(PK-FK)를 edge로 보고 구현하였음 

이 외에도 RelGNN, LLM prediction, Foundation Model, table→graph, RDB specific feature-engineering 등의 방법이 존재함

기존 RelBench, RDBench, 4DBInfer에서 발견한 defect는 다음과 같음.

scalability ← 진짜 billion row에 대해서 graph로 학습을 할 수 있을까?

Noise/Outlier/Missing Value on RDB

link prediction suboptimal by relation imbalance ← table은 1:N 관계 이거나 N:N 관계가 많기 때문에

row interaction/similarity ← 한 테이블에서 컬럼 간의 관계를 조명한 방식은 많지만, 현재 multitable에서 1) intra-table, inter-table column similarity 2) row similarity를 조명한 연구는 없음

downstream task의 종류가 엄청 많고, 하나의 Task 마다 매번 다시 학습해야함

모든 task에 대해서 robust한 representation 또는 data를 만들 수는 없을까?

using LLM ← LLM에 내재되어있는 지식 또는 복잡한 엔티티 간 추론 능력을 활용하여 좀 더 잘 예측할 수 있지 않을까?

Dataset Distillation Context

Dataset Distillation/Condensation 은 처음에 image 도메인에 대해서 만들어짐

기본적인 목표는 큰 데이터셋에 대해 압축된 데이터셋을 만들고, 이를 학습한 모델이 비슷한 성능을 내도록 해보자!

Application으로는 다음과 같음

NAS, Transfer Learning, Continual Learning, Privacy sanitizing

이를 위한 Objective는 다음과 같음

Model Generalization ← 증류된 데이터셋이 대부분의 모델에 대해서 일반화가 되어야합니다.

적은 데이터 셋으로 학습한 모델이 full-dataset으로 학습한 모델과 비슷한 성능을 내야합니다.

이를 위한 기법으로는 다음과 같은 기법들이 존재합니다.

Meta-matching 

후속으로 나온 것이 KKR 방법

Gradient Matching

Distribution Matching

기본적으로 다음과 같은 한계가 존재합니다.

scalability, learning time

추가적으로 Graph 도메인에 적용된 여러 모델들이 존재합니다.

이들도 하지만 같은 문제를 가지고 있습니다.

Tabular Distillation은 명확한 문제점, 그리고 관련 연구가 거의 없는 상태입니다.

New properties of the Data distillation on Tabular Data

heterogeneous한 데이터의 특성

multi-modality를 이해하는 능력

하지만 기본적으로 제가 느끼기엔 PCA/AE 등의 방법이 이미 존재하고, 이를 활용한 ML 기법이 잘 되고 있어서인가 싶기도 합니다.

RDB + Data Distillation

Table ⇒ Graph ⇒ Task(Model)

Table에 초점을 맞출지, Graph에 초점을 맞출지

Ideation Framework

•

What is the real defect or potentials to be unleashed on RDL

•

can the graph condensation or tabular distillation solve the that defect/potential?

•

What is the novelty

RDL View

•

what is the real defect on RDL

Scalability

4DBInfer에서 billion 사이즈의 row를 가진 데이터셋을 사용한 것은 맞으나, 실제로 수행한 task에서는 dataset이 극히 일부임

실제로 관계가 많이 걸려있는 task에 대해 얼마나 걸릴지 테스트해보는 것도 좋을 듯 함 ( 현재는 million의 target table row 가 최대)

Noise/Outlier/Missing value or Relation Imbalance

실제로 이런 것들이 많을지 4DBInfer를 기반으로 조사해보는 것이 필요

해결 방식

Graph condensation을 수행할 때, Relation Imbalance를 해결하는 objective를 가진 condensation 방식 개발

ii.

graph condensation을 수행할 때, positive/negative sampling을 만드는 것을 objective로 가짐 → graph contastive learning을 통한 representation learning

iii.

실제로 dataset distillation on tabular data for outlier detection이라는 논문이 존재

row interaction/similarity 

LLM 활용 Embedding 기반으로 RelBench에서 same node 간 edge 연결 후 실험..?

task agnostic representation learning / graph condensation

graph 전체가 특정 task에 대해서는 1% 밖에 쓰이지 않는다는 논문 존재

ii.

해당 방식은 graph를 rule-base augmenting 하는 것임

iii.

task agnostic하지 않고 node classification task에 대해서만 적용 가능함

using LLM

좀 더 연구를 찾아봐야할 것 같습니다.

Distillation View

•

Table ⇒ Graph ⇒ Model

◦

어떤 부분을 건드릴까?