Search

Scalability on Graph

Large Scale Graph에 대한 연구들의 Dataset들 사이즈를 조사합니다.

GraphFM ( )

Feature Momentum 기법을 이용해서 Graph에서 neighborhood explosion을 개선함
이 데이터셋들은 대규모 그래프 데이터셋이며, 특히 ogbn-products는 244만 개 이상의 노드와 6천만 개 이상의 엣지를 가지는 초대형 그래프입니다.
Dataset
# of Nodes
# of Edges
Avg. Degree
# of Features
# of Classes
Train/Val/Test Split
Flickr
89,250
899,756
10.08
500
7 (single-label)
50% / 25% / 25%
Yelp
716,847
6,997,410
9.76
300
50 (multi-label)
75% / 15% / 10%
Reddit
232,965
11,606,919
49.82
602
41 (single-label)
66% / 10% / 24%
ogbn-arxiv
169,343
1,166,243
6.89
128
40 (single-label)
53.7% / 17.6% / 28.7%
ogbn-products
2,449,029
61,859,140
25.26
100
47 (single-label)
10% / 2% / 88%

GNNAutoScale ()

이전 학습 단계에서 계산된 노드 임베딩을 활용하여 Neighborhood explosion을 개선함
아래 데이터셋은 대규모, 소규모 그래프 데이터셋을 모두 포함함
Category
Dataset
Nodes
Edges
Features
Classes
Task
Label Rate
Small-Scale
CORA
2,708
5,278
1,433
7
Multi-Class
5.17%
CiteSeer
3,327
4,552
3,703
6
Multi-Class
3.61%
PubMed
19,717
44,324
500
3
Multi-Class
0.30%
Coauthor-CS
18,333
81,894
6,805
15
Multi-Class
1.64%
Coauthor-Physics
34,493
247,962
8,415
5
Multi-Class
0.29%
Amazon-Computer
13,752
245,861
767
10
Multi-Class
1.45%
Amazon-Photo
7,650
119,081
745
8
Multi-Class
2.09%
Wiki-CS
11,701
215,863
300
10
Multi-Class
4.96%
Large-Scale
CLUSTER
1,406,436
25,810,340
6
6
Multi-Class
83.35%
Reddit
232,965
11,606,919
602
41
Multi-Class
65.86%
PPI
56,944
793,632
50
121
Multi-Label
78.86%
Flickr
89,250
449,878
500
7
Multi-Class
50.00%
Yelp
716,847
6,977,409
300
100
Multi-Label
75.00%
ogbn-arxiv
169,343
1,157,799
128
40
Multi-Class
53.70%
ogbn-products
2,449,029
61,859,076
100
47
Multi-Class
8.03%

Scaling Graph Neural Networks with Approximate PageRank ()

Dataset
Nodes
Edges
Features
Labels
Label Type
특징 및 한계
Reddit
233K
11.6M
602
41
Single-Label (Multi-Class)
기존 연구에서 자주 사용되던 가장 큰 그래프 중 하나.
Amazon2M
2.5M
61M
100
47
Single-Label (Multi-Class)
노드 수는 크지만, 노드 특징(feature) 크기가 작음.
Twitter Geo-Location
N/A
N/A
N/A
N/A
N/A
그래프 구조가 부적합(노드 대부분이 self-loop만 가지며, 다른 엣지가 없음).
Cora-Full
18.7K
62.4K
8.7K
N/A
Single-Label (Multi-Class)
비교적 작은 학술 그래프 중 하나로, 벤치마크 데이터셋으로 자주 사용됨.
PubMed
19.7K
44.3K
0.5K
N/A
Single-Label (Multi-Class)
벤치마크 데이터셋으로 자주 사용되며, 노드 특징 크기가 작음.
MAG-Scholar-C
10.5M
133M
2.8M
8
Coarse-Grained (Multi-Class)
논문의 상위 연구 분야 8개로 라벨링된 대규모 학술 그래프. MAG 데이터셋 기반으로 새롭게 소개된 대규모 벤치마크 데이터셋.
MAG-Scholar-F
12.4M
173M
2.8M
253
Fine-Grained (Multi-Class)
논문의 세부 연구 분야 253개로 라벨링 (건축학, 지질학, 언어학 등).

FIVES: Feature Interaction Via Edge Search for Large-Scale Tabular Data

테이블 데이터에서 서로 다른 열 간의 상관관계를 포착하는 고차 상호작용
기존 방법은 탐색 기반, DNN이라, 탐색 비용이 높거나 해석 가능성이 부족함
feature graph를 정의 , GNN을 사용
아래는 데이터셋 (Criteo가 몇 천만 인스턴스네..!)
Dataset
#Features
#Train Instances
#Test Instances
Employee
9
29,493
3,278
Bank
20
27,459
13,729
Adult
42
32,561
16,281
Credit
16
100,000
50,000
Criteo
39
41.2M
4.58M
Business1
53
1.57M
0.67M
Business2
59
25.08M
12.53M

GFS: Graph-based Feature Synthesis for Prediction over Relational Databases ( )

RDB를 그래프로 변환하고, 강력한 단일 테이블 모델(DeepFM, FT-Transformer 등)과 결합하여 예측 성능을 향상
Dataset
Train Set
Validation Set
Test Set
Tables
Total Rows
Foreign Keys
Total Columns
AVS (Acquire-valued-shoppers)
96,035
32,011
32,011
3
8.2M
2
23
Outbrain
828,251
21,796
21,796
9
28M
10
23
Diginetica
882,415
20,521
20,521
6
2.7M
7
20
KDD15
72,325
24,108
24,108
4
8.3M
3
19

PinSage → 30억개 Node, 180억개 Edge

Supervised Learning on Relational Databases with Graph Neural Networks ()

Dataset
Train Datapoints
Tables/Node Types
Foreign Keys/Edge Types
Feature Types
URL
Acquire Valued Shoppers Challenge
160,057
7
10
Categorical, Scalar, Datetime
Kaggle Link
Home Credit Default Risk
307,511
7
9
Categorical, Scalar
Kaggle Link
KDD Cup 2014
619,326
4
4
Categorical, Geospatial, Scalar, Textual, Datetime
Kaggle Link

TabGNN

JD 데이터셋이 큰 것 같다.
Task
Dataset
#Samples (Train)
#Samples (Test)
#Features (#Num / #Cat)
Validation Ratio
Domain
Temporal Constraint
Classification
Data1
35,581
8,895
16 / 17
15%
Loan
Yes
Classification
Data2
1,888,366
1,119,778
8 / 23
5%
News
Yes
Classification
Data3
108,801
27,201
19 / 9
15%
Loan
No
Classification
Data4
226,091
34,867
14 / 26
10%
E-commerce
Yes
Classification
Data5
435,329
31,076
8 / 34
10%
E-commerce
Yes
Regression
Data6
1,638,193
702,016
43 / 16
10%
Live streaming
Yes
Regression
Data7
3,923,406
694,194
0 / 25
5%
Retail
Yes
Regression
Data8
10,512,133
29,879
4 / 17
5%
Retail
Yes
Regression
Data9
179,893
43,236
5 / 2
15%
Government
Yes
Classification
Home Credit
307,511
48,744
175 / 51
10%
Loan
Yes
Classification
JD
4,992,910
446,763
6 / 17
5%
E-commerce
Yes

RelBench

RELBench 데이터셋의 예측 과제 및 통계

Dataset
Task Name
Task Type
#Rows of Training Table (Train / Validation / Test)
#Unique Entities
%Train/Test Entity Overlap
#Dst Entities
rel-amazon
user-churn
entity-cls
4,732,555 / 409,792 / 351,885
1,585,983
88.0%
item-churn
entity-cls
2,559,264 / 177,689 / 166,842
416,352
93.1%
user-ltv
entity-reg
4,732,555 / 409,792 / 351,885
1,585,983
88.0%
item-ltv
entity-reg
2,707,679 / 166,978 / 178,334
427,537
93.5%
user-item-purchase
recommendation
5,112,803 / 351,876 / 393,985
1,632,909
87.4%
12,562,384
user-item-rate
recommendation
3,667,157 / 257,939 / 292,609
1,481,360
81.0%
7,665,611
user-item-review
recommendation
2,324,177 / 116,970 / 127,021
894,136
74.1%
5,406,835
rel-avito
ad-ctr
entity-reg
5,100 / 1,766 / 1,816
4,997
59.8%
user-clicks
entity-cls
59,454 / 21,183 / 47,996
66,449
45.3%
user-visits
entity-cls
86,619 / 29,979 / 36,129
63,405
64.6%
user-ad-visit
recommendation
86,616 / 29,979 / 36,129
63,402
64.6%
3,616,174
rel-event
user-attendance
entity-reg
19,261 / 2,014 / 2,006
9,694
14.6%
user-repeat
entity-cls
3,842 / 268 / 246
1,514
11.5%
user-ignore
entity-cls
19,239 / 4,185 / 4,010
9,799
21.1%
rel-f1
driver-dnf
entity-cls
11,411 / 566 / 702
821
50.0%
driver-top3
entity-cls
1,353 / 588 / 726
134
50.0%
driver-position
entity-reg
7,453 / 499 / 760
826
44.6%
rel-hm
user-churn
entity-cls
3,871,410 / 76,556 / 74,575
1,002,984
89.7%
item-sales
entity-reg
5,488,184 / 105,542 / 105,542
105,542
100.0%
user-item-purchase
recommendation
3,878,451 / 74,575 / 67,144
1,004,046
89.2%
13,428,473
rel-stack
user-engagement
entity-cls
1,360,850 / 85,838 / 88,137
88,137
97.4%
user-badge
entity-cls
3,386,276 / 247,398 / 255,360
255,360
96.9%
post-votes
entity-reg
2,453,921 / 156,216 / 160,903
160,903
97.1%
user-post-comment
recommendation
21,239 / 825 / 758
11,453
59.9%
44,940
post-post-related
recommendation
5,855 / 226 / 258
5,524
8.5%
7,456
rel-trial
study-outcome
entity-cls
11,994 / 960 / 825
17,379
0.0%
study-adverse
entity-reg
43,335 / 3,596 / 3,098
30,092
50.0%
site-success
entity-reg
151,407 / 19,740 / 22,617
129,542
42.0%
condition-sponsor-run
recommendation
36,934 / 3,081 / 2,057
3,956
98.4%
533,624
site-sponsor-run
recommendation
669,310 / 37,003 / 27,428
445,513
48.3%
1,565,463

OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs ( )

Task Type
Dataset
Statistics
Node-level
MAG240M
#Nodes: 244,160,499#Edges: 1,728,364,232
Link-level
WikiKG90M†
#Nodes: 87,143,637#Edges: 504,220,369
Graph-level
PCQM4M†
#Graphs: 3,803,453#Edges (Total): 55,399,880

Opinion