Repository for all the code and resources associated with the paper Language-Model Based Informed Partition of Databases to Speed Up Pattern Mining, submitted to ACM SIGMOD/PODS Conference, 2024.
Source code of the approach for partitioning a transactional database adopting static dense language models and its related experiments. The code to convert RDF data into a transactional database can be found in the SWKrimpSim repository associated with our previous works on assessing the structural differences of RDF graphs.
-
The sid folder contains the source code for managing all the data (loading the different formats of databases and code tables, calculating the item and transaction embeddings, partitioning the database, and calculating the different measures - e.g., compression ratio, entropy, etc.).
We use a modified version SLIM (extended to handle more than 16bit item identifiers) to mine the interesting patterns of each partition which can be found here. We suggest using the scripts already available at SWKrimpSim as they simplify greatly using both KRIMP and SLIM implementations.
-
The scripts folder contains auxiliary scripts to launch the different steps of the approach (as we use SLIM externally to mine the patterns of each partition, we have split the different steps in different scripts).
-
The datasets folder contains the .dat files of the synthetic and non-synthetic datasets used in the experiments (corresponding to SBDs and CBDs). LDBs datasets can be found here (GitHub did not allowed us to host >100MB files): DBpedia36, DBpedia2014, DBpedia2016-10, and Kosarak - Due to some server configuration problems, the https layer is not able to serve the files, that's why we have left them with just the http URL. The synthetic ones (SDBs) were built following Guidoti et al., Clustering Individual Transactional Data for Masses of Users (SIGKDD'17), using their code available at tx-means repository. Since SLIM and TX-Means use different file formats to represent transactions, we provide the SBDs and CBDs data files in both formats (TX-Means format is marked with a
_tx
in the filename). -
The experimentalResults folder contains the raw results of the different experiments reported in the paper.
-
The notebook folder contains the initial notebooks used in the early stages of the project, including visualization and vector length/item frequency correlation. However, the code to be used is included in the sid folder (see above).
-
For better reproducibility of the classification and clustering experiments, we also provide pre-trained transaction embeddings for CBDs and SBDs: embeddings for CBDs and embeddings for SBDs. To reproduce the experiments:
- Classification experiments: Follow the instructions in the notebook
notebooks/ExperimentsClassificationTransactions.ipynb
- Clustering experiments: Follow the instructions in the Python scripts
sid/ClusteringExperiments/clustersExperimentsEmbeddings.py
,sid/ClusteringExperiments/clustersExperimentsTkmeans.py
,sid/ClusteringExperiments/clustersExperimentsTxmeans.py
,sid/ClusteringExperiments/clustersExperimentsPairedTX-Emb.py
- Classification experiments: Follow the instructions in the notebook
Not yet published, submitted for ACM SIGMOD/PODS Conference, 2024.