PartMining

Repository for all the code and resources associated with the paper Language-Model Based Informed Partition of Databases to Speed Up Pattern Mining, submitted to ACM SIGMOD/PODS Conference, 2024.

Content of the repository

Source code of the approach for partitioning a transactional database adopting static dense language models and its related experiments. The code to convert RDF data into a transactional database can be found in the SWKrimpSim repository associated with our previous works on assessing the structural differences of RDF graphs.

The sid folder contains the source code for managing all the data (loading the different formats of databases and code tables, calculating the item and transaction embeddings, partitioning the database, and calculating the different measures - e.g., compression ratio, entropy, etc.).

We use a modified version SLIM (extended to handle more than 16bit item identifiers) to mine the interesting patterns of each partition which can be found here. We suggest using the scripts already available at SWKrimpSim as they simplify greatly using both KRIMP and SLIM implementations.
The scripts folder contains auxiliary scripts to launch the different steps of the approach (as we use SLIM externally to mine the patterns of each partition, we have split the different steps in different scripts).
The datasets folder contains the .dat files of the synthetic and non-synthetic datasets used in the experiments (corresponding to SBDs and CBDs). LDBs datasets can be found here (GitHub did not allowed us to host >100MB files): DBpedia36, DBpedia2014, DBpedia2016-10, and Kosarak - Due to some server configuration problems, the https layer is not able to serve the files, that's why we have left them with just the http URL. The synthetic ones (SDBs) were built following Guidoti et al., Clustering Individual Transactional Data for Masses of Users (SIGKDD'17), using their code available at tx-means repository. Since SLIM and TX-Means use different file formats to represent transactions, we provide the SBDs and CBDs data files in both formats (TX-Means format is marked with a _tx in the filename).
The experimentalResults folder contains the raw results of the different experiments reported in the paper.
The notebook folder contains the initial notebooks used in the early stages of the project, including visualization and vector length/item frequency correlation. However, the code to be used is included in the sid folder (see above).
For better reproducibility of the classification and clustering experiments, we also provide pre-trained transaction embeddings for CBDs and SBDs: embeddings for CBDs and embeddings for SBDs. To reproduce the experiments:
- Classification experiments: Follow the instructions in the notebook notebooks/ExperimentsClassificationTransactions.ipynb
- Clustering experiments: Follow the instructions in the Python scripts sid/ClusteringExperiments/clustersExperimentsEmbeddings.py, sid/ClusteringExperiments/clustersExperimentsTkmeans.py, sid/ClusteringExperiments/clustersExperimentsTxmeans.py, sid/ClusteringExperiments/clustersExperimentsPairedTX-Emb.py

Citation

Not yet published, submitted for ACM SIGMOD/PODS Conference, 2024.

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
datasets		datasets
experimentalResults		experimentalResults
notebooks		notebooks
scripts		scripts
sid		sid
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
list_of_files_dbpedia.txt		list_of_files_dbpedia.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PartMining

Content of the repository

Citation

About

Releases

Packages

Contributors 3

Languages

License

sid-unizar/PartMining

Folders and files

Latest commit

History

Repository files navigation

PartMining

Content of the repository

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages