Large Scale Graph에 대한 연구들의 Dataset들 사이즈를 조사합니다.
GraphFM ( )
•
Feature Momentum 기법을 이용해서 Graph에서 neighborhood explosion을 개선함
•
이 데이터셋들은 대규모 그래프 데이터셋이며, 특히 ogbn-products는 244만 개 이상의 노드와 6천만 개 이상의 엣지를 가지는 초대형 그래프입니다.
Dataset | # of Nodes | # of Edges | Avg. Degree | # of Features | # of Classes | Train/Val/Test Split |
Flickr | 89,250 | 899,756 | 10.08 | 500 | 7 (single-label) | 50% / 25% / 25% |
Yelp | 716,847 | 6,997,410 | 9.76 | 300 | 50 (multi-label) | 75% / 15% / 10% |
Reddit | 232,965 | 11,606,919 | 49.82 | 602 | 41 (single-label) | 66% / 10% / 24% |
ogbn-arxiv | 169,343 | 1,166,243 | 6.89 | 128 | 40 (single-label) | 53.7% / 17.6% / 28.7% |
ogbn-products | 2,449,029 | 61,859,140 | 25.26 | 100 | 47 (single-label) | 10% / 2% / 88% |
GNNAutoScale ()
•
이전 학습 단계에서 계산된 노드 임베딩을 활용하여 Neighborhood explosion을 개선함
•
아래 데이터셋은 대규모, 소규모 그래프 데이터셋을 모두 포함함
Category | Dataset | Nodes | Edges | Features | Classes | Task | Label Rate |
Small-Scale | CORA | 2,708 | 5,278 | 1,433 | 7 | Multi-Class | 5.17% |
CiteSeer | 3,327 | 4,552 | 3,703 | 6 | Multi-Class | 3.61% | |
PubMed | 19,717 | 44,324 | 500 | 3 | Multi-Class | 0.30% | |
Coauthor-CS | 18,333 | 81,894 | 6,805 | 15 | Multi-Class | 1.64% | |
Coauthor-Physics | 34,493 | 247,962 | 8,415 | 5 | Multi-Class | 0.29% | |
Amazon-Computer | 13,752 | 245,861 | 767 | 10 | Multi-Class | 1.45% | |
Amazon-Photo | 7,650 | 119,081 | 745 | 8 | Multi-Class | 2.09% | |
Wiki-CS | 11,701 | 215,863 | 300 | 10 | Multi-Class | 4.96% | |
Large-Scale | CLUSTER | 1,406,436 | 25,810,340 | 6 | 6 | Multi-Class | 83.35% |
Reddit | 232,965 | 11,606,919 | 602 | 41 | Multi-Class | 65.86% | |
PPI | 56,944 | 793,632 | 50 | 121 | Multi-Label | 78.86% | |
Flickr | 89,250 | 449,878 | 500 | 7 | Multi-Class | 50.00% | |
Yelp | 716,847 | 6,977,409 | 300 | 100 | Multi-Label | 75.00% | |
ogbn-arxiv | 169,343 | 1,157,799 | 128 | 40 | Multi-Class | 53.70% | |
ogbn-products | 2,449,029 | 61,859,076 | 100 | 47 | Multi-Class | 8.03% |
Scaling Graph Neural Networks with Approximate PageRank ()
Dataset | Nodes | Edges | Features | Labels | Label Type | 특징 및 한계 |
Reddit | 233K | 11.6M | 602 | 41 | Single-Label (Multi-Class) | 기존 연구에서 자주 사용되던 가장 큰 그래프 중 하나. |
Amazon2M | 2.5M | 61M | 100 | 47 | Single-Label (Multi-Class) | 노드 수는 크지만, 노드 특징(feature) 크기가 작음. |
Twitter Geo-Location | N/A | N/A | N/A | N/A | N/A | 그래프 구조가 부적합(노드 대부분이 self-loop만 가지며, 다른 엣지가 없음). |
Cora-Full | 18.7K | 62.4K | 8.7K | N/A | Single-Label (Multi-Class) | 비교적 작은 학술 그래프 중 하나로, 벤치마크 데이터셋으로 자주 사용됨. |
PubMed | 19.7K | 44.3K | 0.5K | N/A | Single-Label (Multi-Class) | 벤치마크 데이터셋으로 자주 사용되며, 노드 특징 크기가 작음. |
MAG-Scholar-C | 10.5M | 133M | 2.8M | 8 | Coarse-Grained (Multi-Class) | 논문의 상위 연구 분야 8개로 라벨링된 대규모 학술 그래프. MAG 데이터셋 기반으로 새롭게 소개된 대규모 벤치마크 데이터셋. |
MAG-Scholar-F | 12.4M | 173M | 2.8M | 253 | Fine-Grained (Multi-Class) | 논문의 세부 연구 분야 253개로 라벨링 (건축학, 지질학, 언어학 등). |
FIVES: Feature Interaction Via Edge Search for Large-Scale Tabular Data
•
테이블 데이터에서 서로 다른 열 간의 상관관계를 포착하는 고차 상호작용
•
기존 방법은 탐색 기반, DNN이라, 탐색 비용이 높거나 해석 가능성이 부족함
•
feature graph를 정의 , GNN을 사용
•
아래는 데이터셋 (Criteo가 몇 천만 인스턴스네..!)
Dataset | #Features | #Train Instances | #Test Instances |
Employee | 9 | 29,493 | 3,278 |
Bank | 20 | 27,459 | 13,729 |
Adult | 42 | 32,561 | 16,281 |
Credit | 16 | 100,000 | 50,000 |
Criteo | 39 | 41.2M | 4.58M |
Business1 | 53 | 1.57M | 0.67M |
Business2 | 59 | 25.08M | 12.53M |
GFS: Graph-based Feature Synthesis for Prediction over Relational Databases ( )
•
RDB를 그래프로 변환하고, 강력한 단일 테이블 모델(DeepFM, FT-Transformer 등)과 결합하여 예측 성능을 향상
Dataset | Train Set | Validation Set | Test Set | Tables | Total Rows | Foreign Keys | Total Columns |
AVS (Acquire-valued-shoppers) | 96,035 | 32,011 | 32,011 | 3 | 8.2M | 2 | 23 |
Outbrain | 828,251 | 21,796 | 21,796 | 9 | 28M | 10 | 23 |
Diginetica | 882,415 | 20,521 | 20,521 | 6 | 2.7M | 7 | 20 |
KDD15 | 72,325 | 24,108 | 24,108 | 4 | 8.3M | 3 | 19 |
PinSage → 30억개 Node, 180억개 Edge
Supervised Learning on Relational Databases with Graph Neural Networks ()
Dataset | Train Datapoints | Tables/Node Types | Foreign Keys/Edge Types | Feature Types | URL |
Acquire Valued Shoppers Challenge | 160,057 | 7 | 10 | Categorical, Scalar, Datetime | Kaggle Link |
Home Credit Default Risk | 307,511 | 7 | 9 | Categorical, Scalar | Kaggle Link |
KDD Cup 2014 | 619,326 | 4 | 4 | Categorical, Geospatial, Scalar, Textual, Datetime | Kaggle Link |
TabGNN
•
JD 데이터셋이 큰 것 같다.
Task | Dataset | #Samples (Train) | #Samples (Test) | #Features (#Num / #Cat) | Validation Ratio | Domain | Temporal Constraint |
Classification | Data1 | 35,581 | 8,895 | 16 / 17 | 15% | Loan | Yes |
Classification | Data2 | 1,888,366 | 1,119,778 | 8 / 23 | 5% | News | Yes |
Classification | Data3 | 108,801 | 27,201 | 19 / 9 | 15% | Loan | No |
Classification | Data4 | 226,091 | 34,867 | 14 / 26 | 10% | E-commerce | Yes |
Classification | Data5 | 435,329 | 31,076 | 8 / 34 | 10% | E-commerce | Yes |
Regression | Data6 | 1,638,193 | 702,016 | 43 / 16 | 10% | Live streaming | Yes |
Regression | Data7 | 3,923,406 | 694,194 | 0 / 25 | 5% | Retail | Yes |
Regression | Data8 | 10,512,133 | 29,879 | 4 / 17 | 5% | Retail | Yes |
Regression | Data9 | 179,893 | 43,236 | 5 / 2 | 15% | Government | Yes |
Classification | Home Credit | 307,511 | 48,744 | 175 / 51 | 10% | Loan | Yes |
Classification | JD | 4,992,910 | 446,763 | 6 / 17 | 5% | E-commerce | Yes |
RelBench
RELBench 데이터셋의 예측 과제 및 통계
Dataset | Task Name | Task Type | #Rows of Training Table (Train / Validation / Test) | #Unique Entities | %Train/Test Entity Overlap | #Dst Entities |
rel-amazon | user-churn | entity-cls | 4,732,555 / 409,792 / 351,885 | 1,585,983 | 88.0% | — |
item-churn | entity-cls | 2,559,264 / 177,689 / 166,842 | 416,352 | 93.1% | — | |
user-ltv | entity-reg | 4,732,555 / 409,792 / 351,885 | 1,585,983 | 88.0% | — | |
item-ltv | entity-reg | 2,707,679 / 166,978 / 178,334 | 427,537 | 93.5% | — | |
user-item-purchase | recommendation | 5,112,803 / 351,876 / 393,985 | 1,632,909 | 87.4% | 12,562,384 | |
user-item-rate | recommendation | 3,667,157 / 257,939 / 292,609 | 1,481,360 | 81.0% | 7,665,611 | |
user-item-review | recommendation | 2,324,177 / 116,970 / 127,021 | 894,136 | 74.1% | 5,406,835 | |
rel-avito | ad-ctr | entity-reg | 5,100 / 1,766 / 1,816 | 4,997 | 59.8% | — |
user-clicks | entity-cls | 59,454 / 21,183 / 47,996 | 66,449 | 45.3% | — | |
user-visits | entity-cls | 86,619 / 29,979 / 36,129 | 63,405 | 64.6% | — | |
user-ad-visit | recommendation | 86,616 / 29,979 / 36,129 | 63,402 | 64.6% | 3,616,174 | |
rel-event | user-attendance | entity-reg | 19,261 / 2,014 / 2,006 | 9,694 | 14.6% | — |
user-repeat | entity-cls | 3,842 / 268 / 246 | 1,514 | 11.5% | — | |
user-ignore | entity-cls | 19,239 / 4,185 / 4,010 | 9,799 | 21.1% | — | |
rel-f1 | driver-dnf | entity-cls | 11,411 / 566 / 702 | 821 | 50.0% | — |
driver-top3 | entity-cls | 1,353 / 588 / 726 | 134 | 50.0% | — | |
driver-position | entity-reg | 7,453 / 499 / 760 | 826 | 44.6% | — | |
rel-hm | user-churn | entity-cls | 3,871,410 / 76,556 / 74,575 | 1,002,984 | 89.7% | — |
item-sales | entity-reg | 5,488,184 / 105,542 / 105,542 | 105,542 | 100.0% | — | |
user-item-purchase | recommendation | 3,878,451 / 74,575 / 67,144 | 1,004,046 | 89.2% | 13,428,473 | |
rel-stack | user-engagement | entity-cls | 1,360,850 / 85,838 / 88,137 | 88,137 | 97.4% | — |
user-badge | entity-cls | 3,386,276 / 247,398 / 255,360 | 255,360 | 96.9% | — | |
post-votes | entity-reg | 2,453,921 / 156,216 / 160,903 | 160,903 | 97.1% | — | |
user-post-comment | recommendation | 21,239 / 825 / 758 | 11,453 | 59.9% | 44,940 | |
post-post-related | recommendation | 5,855 / 226 / 258 | 5,524 | 8.5% | 7,456 | |
rel-trial | study-outcome | entity-cls | 11,994 / 960 / 825 | 17,379 | 0.0% | — |
study-adverse | entity-reg | 43,335 / 3,596 / 3,098 | 30,092 | 50.0% | — | |
site-success | entity-reg | 151,407 / 19,740 / 22,617 | 129,542 | 42.0% | — | |
condition-sponsor-run | recommendation | 36,934 / 3,081 / 2,057 | 3,956 | 98.4% | 533,624 | |
site-sponsor-run | recommendation | 669,310 / 37,003 / 27,428 | 445,513 | 48.3% | 1,565,463 |
OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs ( )
Task Type | Dataset | Statistics |
Node-level | MAG240M | #Nodes: 244,160,499#Edges: 1,728,364,232 |
Link-level | WikiKG90M† | #Nodes: 87,143,637#Edges: 504,220,369 |
Graph-level | PCQM4M† | #Graphs: 3,803,453#Edges (Total): 55,399,880 |
Opinion
•
