Tabular Dataset Distillation for GNN

Field

GNN in prod

NAS + Relbench + Recommendation ⇒ eg) Database/dataset에 들어있는 수많은 feature에 대해서 privacy를 보장하는 high quality fake dataset을 만들어낼 수 있을까?

쉽게 생각하면 Multi-table graph condensation + synthesis

Tabular Distillation for GNN

Graph Condensation

Graph Scaling

Tabular Learning

Tabular GNN

Relational Deep Learning

- 테이블 매칭 문제

⇒ 관계형 데이터셋에서 embedding 어떻게 잘 만들까

→ LLM 활용 논문

→ 새로운 메시지 패싱, but simple

→ 뭔가 개발 프레임워크 느낌

→ foundation model 느낌 읽어봐야함

Graph Synthesis

확인해야할거

•

Scaling GNN to Larger Tabular Data

•

large-scale data가 실제로 있는지

•

graph condensation, graph distillation, tabular distillation 개념이 있는지, 이걸 하는 목적은 무엇인지

•

distillation을 해서 SSL을 하기 좋은 graph를 만들고 싶은데 이래도 되는걸까

•

large-scale tabular data의 문제가 뭔지

•

unseen row에 대해서도 동작하게 하려면 GNN을 어떻게 해야할지

Dataset

•

CTU Relational | Home

•

4DBInfer 데이터

•

RelBench 데이터

•

graph

◦

Big Graph Data Sets

◦

Stanford Large Network Dataset Collection

Todo

•

목요일

시급히 읽어야할 몇 개의 논문들

◦

Relational Deep Learning 후속으로 나온 논문들

◦

graph condensation, graph distillation, tabular distillation 개념이 있는지, 이걸 하는 목적은 무엇인지

◦

unseen row에 대해서도 동작하게 하려면 GNN을 어떻게 해야할지

•

화요일 수요일

아래 논문들 읽기

•

목요일

◦

data condensation 이해하기

▪

dataset distillation

▪

gradient matching

▪

distribution matching 이해

◦

이거랑 Tabular dataset과의 관계 공부하기

▪

New Properties of the Data Distillation Method When Working With Tabular Data

◦

RDL에 LLM 적용한 논문 읽기

▪

Tackling prediction tasks in relational databases with LLMs

▪

Large Scale Transfer Learning for Tabular Data via Language Modeling

◦

이제 진짜 거의 다 읽어서 펼쳐놓고 생각 좀 하기

금요일

생각해봐야할 지점

Scalability

•

RelBench에서는 Scalability가 문제로써 직접적으로 드러나지 않았다.

•

다만 RDB의 billion row에 대해서 우리가 생각해봐야한다 정도

•

PinSage 등과 같이 Web-Scale Graph 사이즈가 어느 되는지 살펴보고, Scalability 문제를 address할 수 있는지 살펴보자

Scalability on Graph

Defect on RelBench

3가지 정도의 Defect가 있는데, 이걸 Data Regeneration / Training Data Development Problem으로 풀어낼 수가 있느냐

•

Noise/Outlier/Missing Value on RDB

•

Node/Relation Imbalance on RDB ← because it is basically heterogeneous graph and RDB can have data imbalance for each table

•

Considering Row Interaction/Similarity

◦

Current RelBench focus on n-partite graph. 

◦

We can reconstruct graph by using table’s column

•

Robustness ← because of data imbalance, it maybe worthy to generate good positive/negative sample for contrastive learning. because there’s some research applying contrastive learning on GNN tabular learning

•

Pre-trained Relational Deep Representation Learning

◦

Downstream task agnostic pre-trained Deep Representation Learning

Application of Graph Condensation ( on RelBench )

•

보통의 Graph Condensation은 하고 나면 어떤 효과를 기대하는 건가?

◦

Downstream task로 잘 적응

◦

Scalable Dataset에 대해서 잘 적용…?

•

추가적인 Graph Condensation의 의의는 없는건가?

◦

Dataset Distillation 같이 뭔가 noise가 많은 거에서 refine하는건 없는건가?

▪

Dataset Regeneration For Sequential Recommendation 같은 느낌

•

이 Graph Condensation의 조건은…?

◦

Downstream task에 완전히 초점이 맞춰져있는건가

◦

Join Optimization 이거는 뭐지

•

RelBench에서는 어떤 Graph Condensation 조건이 필요한건가

◦

어떤 Task를 진행할지 모른다.

◦

Time-variant한 Dataset에 대해서도 잘 되야한다. → 기본적으로 Graph Condensation이 Continual Learning 에도 쓰인다는데 연계가 될 수도 있을 거 같다.

◦

Heterogeneous한 Node feature를 다룰 수 있는 Graph Condensation이여야한다.

▪

Node attribute로써 multimodal & heterophilic한 node attribute를 가진 Graph condensation이 존재하는가..?

생각생각생각생각