datasets

GATNE dataset

class cogdl.datasets.gatne.AmazonDataset[source]

Bases: cogdl.datasets.gatne.GatneDataset

class cogdl.datasets.gatne.GatneDataset(root, name)[source]

Bases: cogdl.data.dataset.Dataset

The network datasets “Amazon”, “Twitter” and “YouTube” from the “Representation Learning for Attributed Multiplex Heterogeneous Network” paper.

Args:

root (string): Root directory where the dataset should be saved. name (string): The name of the dataset ("Amazon",

"Twitter", "YouTube").
download()[source]

Downloads the dataset to the self.raw_dir folder.

get(idx)[source]

Gets the data object at index idx.

process()[source]

Processes the dataset to the self.processed_dir folder.

processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://github.com/THUDM/GATNE/raw/master/data'
class cogdl.datasets.gatne.TwitterDataset[source]

Bases: cogdl.datasets.gatne.GatneDataset

class cogdl.datasets.gatne.YouTubeDataset[source]

Bases: cogdl.datasets.gatne.GatneDataset

cogdl.datasets.gatne.read_gatne_data(folder)[source]

GCC dataset

class cogdl.datasets.gcc_data.Edgelist(root, name)[source]

Bases: cogdl.data.dataset.Dataset

download()[source]

Downloads the dataset to the self.raw_dir folder.

get(idx)[source]

Gets the data object at index idx.

num_classes

The number of classes in the dataset.

process()[source]

Processes the dataset to the self.processed_dir folder.

processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://github.com/cenyk1230/gcc-data/raw/master'
class cogdl.datasets.gcc_data.GCCDataset(root, name)[source]

Bases: cogdl.data.dataset.Dataset

download()[source]

Downloads the dataset to the self.raw_dir folder.

get(idx)[source]

Gets the data object at index idx.

preprocess(root, name)[source]
processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://github.com/cenyk1230/gcc-data/raw/master'
class cogdl.datasets.gcc_data.KDD_ICDM_GCCDataset[source]

Bases: cogdl.datasets.gcc_data.GCCDataset

class cogdl.datasets.gcc_data.SIGIR_CIKM_GCCDataset[source]

Bases: cogdl.datasets.gcc_data.GCCDataset

class cogdl.datasets.gcc_data.SIGMOD_ICDE_GCCDataset[source]

Bases: cogdl.datasets.gcc_data.GCCDataset

class cogdl.datasets.gcc_data.USAAirportDataset[source]

Bases: cogdl.datasets.gcc_data.Edgelist

GTN dataset

class cogdl.datasets.gtn_data.ACM_GTNDataset[source]

Bases: cogdl.datasets.gtn_data.GTNDataset

class cogdl.datasets.gtn_data.DBLP_GTNDataset[source]

Bases: cogdl.datasets.gtn_data.GTNDataset

class cogdl.datasets.gtn_data.GTNDataset(root, name)[source]

Bases: cogdl.data.dataset.Dataset

The network datasets “ACM”, “DBLP” and “IMDB” from the “Graph Transformer Networks” paper.

Args:

root (string): Root directory where the dataset should be saved. name (string): The name of the dataset ("gtn-acm",

"gtn-dblp", "gtn-imdb").
apply_to_device(device)[source]
download()[source]

Downloads the dataset to the self.raw_dir folder.

get(idx)[source]

Gets the data object at index idx.

num_classes

The number of classes in the dataset.

process()[source]

Processes the dataset to the self.processed_dir folder.

processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

read_gtn_data(folder)[source]
class cogdl.datasets.gtn_data.IMDB_GTNDataset[source]

Bases: cogdl.datasets.gtn_data.GTNDataset

HAN dataset

class cogdl.datasets.han_data.ACM_HANDataset[source]

Bases: cogdl.datasets.han_data.HANDataset

class cogdl.datasets.han_data.DBLP_HANDataset[source]

Bases: cogdl.datasets.han_data.HANDataset

class cogdl.datasets.han_data.HANDataset(root, name)[source]

Bases: cogdl.data.dataset.Dataset

The network datasets “ACM”, “DBLP” and “IMDB” from the “Heterogeneous Graph Attention Network” paper.

Args:

root (string): Root directory where the dataset should be saved. name (string): The name of the dataset ("han-acm",

"han-dblp", "han-imdb").
apply_to_device(device)[source]
download()[source]

Downloads the dataset to the self.raw_dir folder.

get(idx)[source]

Gets the data object at index idx.

num_classes

The number of classes in the dataset.

process()[source]

Processes the dataset to the self.processed_dir folder.

processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

read_gtn_data(folder)[source]
class cogdl.datasets.han_data.IMDB_HANDataset[source]

Bases: cogdl.datasets.han_data.HANDataset

cogdl.datasets.han_data.sample_mask(idx, length)[source]

Create mask.

KG dataset

class cogdl.datasets.kg_data.BidirectionalOneShotIterator(dataloader_head, dataloader_tail)[source]

Bases: object

static one_shot_iterator(dataloader)[source]

Transform a PyTorch Dataloader into python iterator

class cogdl.datasets.kg_data.FB13Datset[source]

Bases: cogdl.datasets.kg_data.KnowledgeGraphDataset

class cogdl.datasets.kg_data.FB13SDatset[source]

Bases: cogdl.datasets.kg_data.KnowledgeGraphDataset

url = 'https://raw.githubusercontent.com/cenyk1230/test-data/main'
class cogdl.datasets.kg_data.FB15k237Datset[source]

Bases: cogdl.datasets.kg_data.KnowledgeGraphDataset

class cogdl.datasets.kg_data.FB15kDatset[source]

Bases: cogdl.datasets.kg_data.KnowledgeGraphDataset

class cogdl.datasets.kg_data.KnowledgeGraphDataset(root, name)[source]

Bases: cogdl.data.dataset.Dataset

download()[source]

Downloads the dataset to the self.raw_dir folder.

get(idx)[source]

Gets the data object at index idx.

num_entities
num_relations
process()[source]

Processes the dataset to the self.processed_dir folder.

processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

test_start_idx
train_start_idx
url = 'https://raw.githubusercontent.com/thunlp/OpenKE/OpenKE-PyTorch/benchmarks'
valid_start_idx
class cogdl.datasets.kg_data.TestDataset(triples, all_true_triples, nentity, nrelation, mode)[source]

Bases: torch.utils.data.dataset.Dataset

static collate_fn(data)[source]
class cogdl.datasets.kg_data.TrainDataset(triples, nentity, nrelation, negative_sample_size, mode)[source]

Bases: torch.utils.data.dataset.Dataset

static collate_fn(data)[source]
static count_frequency(triples, start=4)[source]

Get frequency of a partial triple like (head, relation) or (relation, tail) The frequency will be used for subsampling like word2vec

static get_true_head_and_tail(triples)[source]

Build a dictionary of true triples that will be used to filter these true triples for negative sampling

class cogdl.datasets.kg_data.WN18Datset[source]

Bases: cogdl.datasets.kg_data.KnowledgeGraphDataset

class cogdl.datasets.kg_data.WN18RRDataset[source]

Bases: cogdl.datasets.kg_data.KnowledgeGraphDataset

cogdl.datasets.kg_data.read_triplet_data(folder)[source]

Matlab matrix dataset

class cogdl.datasets.matlab_matrix.BlogcatalogDataset[source]

Bases: cogdl.datasets.matlab_matrix.MatlabMatrix

class cogdl.datasets.matlab_matrix.DblpNEDataset[source]

Bases: cogdl.datasets.matlab_matrix.NetworkEmbeddingCMTYDataset

class cogdl.datasets.matlab_matrix.FlickrDataset[source]

Bases: cogdl.datasets.matlab_matrix.MatlabMatrix

class cogdl.datasets.matlab_matrix.MatlabMatrix(root, name, url)[source]

Bases: cogdl.data.dataset.Dataset

networks from the http://leitang.net/code/social-dimension/data/ or http://snap.stanford.edu/node2vec/

Args:
root (string): Root directory where the dataset should be saved. name (string): The name of the dataset ("Blogcatalog").
download()[source]

Downloads the dataset to the self.raw_dir folder.

get(idx)[source]

Gets the data object at index idx.

num_classes

The number of classes in the dataset.

num_nodes
process()[source]

Processes the dataset to the self.processed_dir folder.

processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

class cogdl.datasets.matlab_matrix.NetworkEmbeddingCMTYDataset(root, name, url)[source]

Bases: cogdl.data.dataset.Dataset

download()[source]

Downloads the dataset to the self.raw_dir folder.

get(idx)[source]

Gets the data object at index idx.

num_classes

The number of classes in the dataset.

num_nodes
process()[source]

Processes the dataset to the self.processed_dir folder.

processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

class cogdl.datasets.matlab_matrix.PPIDataset[source]

Bases: cogdl.datasets.matlab_matrix.MatlabMatrix

class cogdl.datasets.matlab_matrix.WikipediaDataset[source]

Bases: cogdl.datasets.matlab_matrix.MatlabMatrix

class cogdl.datasets.matlab_matrix.YoutubeNEDataset[source]

Bases: cogdl.datasets.matlab_matrix.NetworkEmbeddingCMTYDataset

PyG OGB dataset

class cogdl.datasets.ogb.OGBArxivDataset[source]

Bases: cogdl.datasets.ogb.OGBNDataset

class cogdl.datasets.ogb.OGBCodeDataset[source]

Bases: cogdl.datasets.ogb.OGBGDataset

class cogdl.datasets.ogb.OGBGDataset(root, name)[source]

Bases: cogdl.data.dataset.Dataset

get(idx)[source]

Gets the data object at index idx.

get_loader(args)[source]
get_subset(subset)[source]
num_classes

The number of classes in the dataset.

class cogdl.datasets.ogb.OGBMAGDataset[source]

Bases: cogdl.datasets.ogb.OGBNDataset

class cogdl.datasets.ogb.OGBMolbaceDataset[source]

Bases: cogdl.datasets.ogb.OGBGDataset

class cogdl.datasets.ogb.OGBMolhivDataset[source]

Bases: cogdl.datasets.ogb.OGBGDataset

class cogdl.datasets.ogb.OGBMolpcbaDataset[source]

Bases: cogdl.datasets.ogb.OGBGDataset

class cogdl.datasets.ogb.OGBNDataset(root, name)[source]

Bases: cogdl.data.dataset.Dataset

get(idx)[source]

Gets the data object at index idx.

get_evaluator()[source]
get_loss_fn()[source]
class cogdl.datasets.ogb.OGBPapers100MDataset[source]

Bases: cogdl.datasets.ogb.OGBNDataset

class cogdl.datasets.ogb.OGBPpaDataset[source]

Bases: cogdl.datasets.ogb.OGBGDataset

class cogdl.datasets.ogb.OGBProductsDataset[source]

Bases: cogdl.datasets.ogb.OGBNDataset

class cogdl.datasets.ogb.OGBProteinsDataset[source]

Bases: cogdl.datasets.ogb.OGBNDataset

cogdl.datasets.ogb.coalesce(row, col, edge_attr=None)[source]

PyG strategies dataset

This file is borrowed from https://github.com/snap-stanford/pretrain-gnns/

class cogdl.datasets.strategies_data.BACEDataset(transform=None, pre_transform=None, pre_filter=None, empty=False)[source]

Bases: cogdl.data.dataset.MultiGraphDataset

download()[source]

Downloads the dataset to the self.raw_dir folder.

process()[source]

Processes the dataset to the self.processed_dir folder.

processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

class cogdl.datasets.strategies_data.BBBPDataset(transform=None, pre_transform=None, pre_filter=None, empty=False)[source]

Bases: cogdl.data.dataset.MultiGraphDataset

download()[source]

Downloads the dataset to the self.raw_dir folder.

process()[source]

Processes the dataset to the self.processed_dir folder.

processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

class cogdl.datasets.strategies_data.BatchAE(batch=None, **kwargs)[source]

Bases: cogdl.data.data.Data

cat_dim(key)[source]

Returns the dimension in which the attribute key with content value gets concatenated when creating batches.

Note

This method is for internal use only, and should only be overridden if the batch concatenation process is corrupted for a specific data attribute.

static from_data_list(data_list)[source]

Constructs a batch object from a python list holding torch_geometric.data.Data objects. The assignment vector batch is created on the fly.

num_graphs

Returns the number of graphs in the batch.

class cogdl.datasets.strategies_data.BatchFinetune(batch=None, **kwargs)[source]

Bases: cogdl.data.data.Data

static from_data_list(data_list)[source]

Constructs a batch object from a python list holding torch_geometric.data.Data objects. The assignment vector batch is created on the fly.

num_graphs

Returns the number of graphs in the batch.

class cogdl.datasets.strategies_data.BatchMasking(batch=None, **kwargs)[source]

Bases: cogdl.data.data.Data

cumsum(key, item)[source]

If True, the attribute key with content item should be added up cumulatively before concatenated together. .. note:

This method is for internal use only, and should only be overridden
if the batch concatenation process is corrupted for a specific data
attribute.
static from_data_list(data_list)[source]

Constructs a batch object from a python list holding torch_geometric.data.Data objects. The assignment vector batch is created on the fly.

num_graphs

Returns the number of graphs in the batch.

class cogdl.datasets.strategies_data.BatchSubstructContext(batch=None, **kwargs)[source]

Bases: cogdl.data.data.Data

cat_dim(key)[source]

Returns the dimension in which the attribute key with content value gets concatenated when creating batches.

Note

This method is for internal use only, and should only be overridden if the batch concatenation process is corrupted for a specific data attribute.

cumsum(key, item)[source]

If True, the attribute key with content item should be added up cumulatively before concatenated together. .. note:

This method is for internal use only, and should only be overridden
if the batch concatenation process is corrupted for a specific data
attribute.
static from_data_list(data_list)[source]

Constructs a batch object from a python list holding torch_geometric.data.Data objects. The assignment vector batch is created on the fly.

num_graphs

Returns the number of graphs in the batch.

class cogdl.datasets.strategies_data.BioDataset(data_type='unsupervised', empty=False, transform=None, pre_transform=None, pre_filter=None)[source]

Bases: cogdl.data.dataset.MultiGraphDataset

download()[source]

Downloads the dataset to the self.raw_dir folder.

process()[source]

Processes the dataset to the self.processed_dir folder.

processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

class cogdl.datasets.strategies_data.ChemExtractSubstructureContextPair(k, l1, l2)[source]

Bases: object

class cogdl.datasets.strategies_data.DataLoaderAE(dataset, batch_size=1, shuffle=True, **kwargs)[source]

Bases: torch.utils.data.dataloader.DataLoader

class cogdl.datasets.strategies_data.DataLoaderFinetune(dataset, batch_size=1, shuffle=True, **kwargs)[source]

Bases: torch.utils.data.dataloader.DataLoader

class cogdl.datasets.strategies_data.DataLoaderMasking(dataset, batch_size=1, shuffle=True, **kwargs)[source]

Bases: torch.utils.data.dataloader.DataLoader

class cogdl.datasets.strategies_data.DataLoaderSubstructContext(dataset, batch_size=1, shuffle=True, **kwargs)[source]

Bases: torch.utils.data.dataloader.DataLoader

class cogdl.datasets.strategies_data.ExtractSubstructureContextPair(l1, center=True)[source]

Bases: object

class cogdl.datasets.strategies_data.MaskAtom(num_atom_type, num_edge_type, mask_rate, mask_edge=True)[source]

Bases: object

Borrowed from https://github.com/snap-stanford/pretrain-gnns/

class cogdl.datasets.strategies_data.MaskEdge(mask_rate)[source]

Bases: object

Borrowed from https://github.com/snap-stanford/pretrain-gnns/

class cogdl.datasets.strategies_data.MoleculeDataset(data_type='unsupervised', transform=None, pre_transform=None, pre_filter=None, empty=False)[source]

Bases: cogdl.data.dataset.MultiGraphDataset

download()[source]

Downloads the dataset to the self.raw_dir folder.

process()[source]

Processes the dataset to the self.processed_dir folder.

processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

class cogdl.datasets.strategies_data.NegativeEdge[source]

Bases: object

Borrowed from https://github.com/snap-stanford/pretrain-gnns/

class cogdl.datasets.strategies_data.TestBioDataset(data_type='unsupervised', root='testbio', transform=None, pre_transform=None, pre_filter=None)[source]

Bases: cogdl.data.dataset.MultiGraphDataset

class cogdl.datasets.strategies_data.TestChemDataset(data_type='unsupervised', root='testchem', transform=None, pre_transform=None, pre_filter=None)[source]

Bases: cogdl.data.dataset.MultiGraphDataset

cogdl.datasets.strategies_data.graph_data_obj_to_nx(data)[source]
cogdl.datasets.strategies_data.graph_data_obj_to_nx_simple(data)[source]

Converts graph Data object required by the pytorch geometric package to network x data object. NB: Uses simplified atom and bond features, and represent as indices. NB: possible issues with recapitulating relative stereochemistry since the edges in the nx object are unordered. :param data: pytorch geometric Data object :return: network x object

cogdl.datasets.strategies_data.nx_to_graph_data_obj(g, center_id, allowable_features_downstream=None, allowable_features_pretrain=None, node_id_to_go_labels=None)[source]
cogdl.datasets.strategies_data.nx_to_graph_data_obj_simple(G)[source]

Converts nx graph to pytorch geometric Data object. Assume node indices are numbered from 0 to num_nodes - 1. NB: Uses simplified atom and bond features, and represent as indices. NB: possible issues with recapitulating relative stereochemistry since the edges in the nx object are unordered. :param G: nx graph obj :return: pytorch geometric Data object

cogdl.datasets.strategies_data.reset_idxes(G)[source]

Resets node indices such that they are numbered from 0 to num_nodes - 1 :param G: :return: copy of G with relabelled node indices, mapping

TU dataset

class cogdl.datasets.tu_data.CollabDataset[source]

Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.ENZYMES[source]

Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.ImdbBinaryDataset[source]

Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.ImdbMultiDataset[source]

Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.MUTAGDataset[source]

Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.NCT109Dataset[source]

Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.NCT1Dataset[source]

Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.PTCMRDataset[source]

Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.ProtainsDataset[source]

Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.RedditBinary[source]

Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.RedditMulti12K[source]

Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.RedditMulti5K[source]

Bases: cogdl.datasets.tu_data.TUDataset

class cogdl.datasets.tu_data.TUDataset(root, name)[source]

Bases: cogdl.data.dataset.Dataset

download()[source]

Downloads the dataset to the self.raw_dir folder.

get(idx)[source]

Gets the data object at index idx.

num_classes

The number of classes in the dataset.

num_edge_attributes
num_edge_labels
num_node_attributes
num_node_labels
process()[source]

Processes the dataset to the self.processed_dir folder.

processed_file_names

The name of the files to find in the self.processed_dir folder in order to skip the processing.

raw_file_names

The name of the files to find in the self.raw_dir folder in order to skip the download.

url = 'https://www.chrsmrrs.com/graphkerneldatasets'
cogdl.datasets.tu_data.cat(seq)[source]
cogdl.datasets.tu_data.coalesce(index, value, m, n)[source]
cogdl.datasets.tu_data.normalize_feature(data)[source]
cogdl.datasets.tu_data.parse_txt_array(src, sep=None, start=0, end=None, dtype=None, device=None)[source]
cogdl.datasets.tu_data.read_file(folder, prefix, name, dtype=None)[source]
cogdl.datasets.tu_data.read_tu_data(folder, prefix)[source]
cogdl.datasets.tu_data.read_txt_array(path, sep=None, start=0, end=None, dtype=None, device=None)[source]
cogdl.datasets.tu_data.segment(src, indptr)[source]
cogdl.datasets.tu_data.split(data, batch)[source]

Module contents

cogdl.datasets.build_dataset(args)[source]
cogdl.datasets.build_dataset_from_name(dataset)[source]
cogdl.datasets.build_dataset_from_path(data_path, task)[source]
cogdl.datasets.register_dataset(name)[source]

New dataset types can be added to cogdl with the register_dataset() function decorator.

For example:

@register_dataset('my_dataset')
class MyDataset():
    (...)
Args:
name (str): the name of the dataset
cogdl.datasets.try_import_dataset(dataset)[source]