Documentation of the main classes
1. Setup Data Folder
Module to set up a data directory with a predefined structure.
This module provides the DataFolderSetup class, which creates a directory structure for a data folder. The structure includes nodes and relationships folders with specified subfolders.
Classes:
Name | Description |
---|---|
DataFolderSetup |
Class to set up a data directory with a predefined structure. |
Functions:
Name | Description |
---|---|
main |
Main function to set up the data directory. |
SetupDataFolder
Class to set up a data directory with a predefined structure.
Attributes:
Name | Type | Description |
---|---|---|
data_folder |
str
|
The name of the data folder. |
base_path |
str
|
The base path for the data directory. |
structure |
dict
|
The structure of directories to create. |
Source code in chemgraphbuilder/setup_data_folder.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
__init__()
Initializes the DataFolderSetup with the data folder name and directory structure.
Source code in chemgraphbuilder/setup_data_folder.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
create_folder(path)
staticmethod
Creates a folder if it does not already exist.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The path of the folder to create. |
required |
Source code in chemgraphbuilder/setup_data_folder.py
45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
setup()
Sets up the data directory structure based on the predefined structure.
Source code in chemgraphbuilder/setup_data_folder.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
main()
Main function to set up the data directory.
Source code in chemgraphbuilder/setup_data_folder.py
78 79 80 81 82 83 |
|
2. Neo4j Driver
Module for managing connections to a Neo4j database.
This module provides classes and methods to establish and manage connections with a Neo4j database, including custom error handling.
Neo4jBase
Base class to manage connections with the Neo4j database.
Attributes: - uri: The connection URI for the Neo4j database. - user: The username to use for authentication. - driver: The driver object used to interact with the Neo4j database.
Methods: - connect_to_neo4j: Establish a connection to the Neo4j database. - close: Close the connection to the Neo4j database.
Source code in chemgraphbuilder/neo4jdriver.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
|
close()
Close the connection to the Neo4j database.
Source code in chemgraphbuilder/neo4jdriver.py
57 58 59 60 61 |
|
connect_to_neo4j()
Establish a connection to the Neo4j database using provided URI and username.
Source code in chemgraphbuilder/neo4jdriver.py
44 45 46 47 48 49 50 51 52 53 54 55 |
|
Neo4jConnectionError
Bases: Exception
Custom exception for Neo4j connection errors.
Source code in chemgraphbuilder/neo4jdriver.py
17 18 |
|
3. Node Properties Extractor
This module defines the NodePropertiesExtractor
class, responsible for
extracting data from the PubChem database to build knowledge graphs in Neo4j.
The class focuses on nodes representing chemical entities and their relationships,
allowing users to query chemical data and construct a graph-based representation
of chemical compounds, their assays, related genes, and proteins.
The primary functionality revolves around fetching detailed information about specified enzymes from PubChem, including assay data, gene properties, protein properties, and compound properties. It processes this data into a structured format suitable for knowledge graph construction, specifically tailored for use with Neo4j databases.
Classes:
Name | Description |
---|---|
- NodePropertiesExtractor |
A class to extract data from PubChem to build knowledge graphs in Neo4j. |
Usage Example
enzyme_list = ['CYP2D6', 'CYP3A4'] extractor = NodePropertiesExtractor(enzyme_list) df = extractor.run() This example initiates the extractor with a list of enzymes, fetches their data from PubChem, processes it, and potentially prepares it for knowledge graph construction in Neo4j.
Note
To fully utilize this class, ensure you have network access to the PubChem API for data retrieval and a Neo4j database instance for knowledge graph construction. The class methods facilitate data extraction and processing, but integrating the output into Neo4j requires additional steps outside the scope of this class.
NodePropertiesExtractor
Extracts data from PubChem to build knowledge graphs in Neo4j, focusing on nodes representing chemical entities and their relationships. This class serves as a bridge between the PubChem database and Neo4j, allowing users to query chemical data and construct a graph-based representation of chemical compounds, their assays, related genes, and proteins.
The primary functionality revolves around fetching detailed information about specified enzymes from PubChem, including assay data, gene properties, protein properties, and compound properties. It processes this data into a structured format suitable for knowledge graph construction, specifically tailored for use with Neo4j databases.
Attributes:
Name | Type | Description |
---|---|---|
enzyme_list |
list of str
|
Enzymes to query in the PubChem database. |
_base_url |
str
|
Base URL for the PubChem API requests. |
_sep |
str
|
Delimiter for parsing CSV data from PubChem. |
_enzyme_count |
int
|
Number of enzymes in the enzyme_list, calculated at |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
enzyme_list |
list of str
|
List of enzyme names for which assay data |
required |
base_url |
str
|
Base URL for PubChem API requests. Defaults to |
'https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/target/genesymbol'
|
sep |
str
|
Separator used for parsing CSV data returned |
','
|
Usage Example
enzyme_list = ['CYP2D6', 'CYP3A4'] extractor = NodePropertiesExtractor(enzyme_list) df = extractor.run() This example initiates the extractor with a list of enzymes, fetches their data from PubChem, processes it, and potentially prepares it for knowledge graph construction in Neo4j.
Note
To fully utilize this class, ensure you have network access to the PubChem API for data retrieval and a Neo4j database instance for knowledge graph construction. The class methods facilitate data extraction and processing, but integrating the output into Neo4j requires additional steps outside the scope of this class.
Source code in chemgraphbuilder/node_properties_extractor.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 |
|
__init__(enzyme_list, base_url='https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/target/genesymbol', sep=',')
Initializes a NodePropertiesExtractor instance, setting up the base URL for API requests, the separator for CSV parsing, and the list of enzymes to query from the PubChem database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
enzyme_list |
list of str
|
A list of enzyme names for which to fetch |
required |
base_url |
str
|
The base URL for PubChem API requests. |
'https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/target/genesymbol'
|
sep |
str
|
The delimiter to use for parsing CSV files |
','
|
Attributes:
Name | Type | Description |
---|---|---|
_base_url |
str
|
Stores the base URL for API requests. |
_sep |
str
|
Stores the delimiter for parsing CSV data. |
enzyme_list |
list of str
|
Stores the list of enzyme names provided |
_enzyme_count |
int
|
The number of enzymes in the enzyme_list. |
Source code in chemgraphbuilder/node_properties_extractor.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
|
extract_assay_properties(main_data)
Extracts detailed properties of assays from PubChem for each unique assay ID found in the input data file.
This method processes an input CSV file containing assay IDs (AID) and performs concurrent HTTP requests to fetch detailed assay properties from the PubChem database. The retrieved details include assay type, activity name, source name, source ID, name, and description. These properties are compiled into a new DataFrame, which is then saved to a CSV file for further analysis or use.
The method employs a ThreadPoolExecutor to manage concurrent requests efficiently, improving the performance when dealing with a large number of assay IDs. Errors encountered during data fetching are logged, and the process continues with the next assay ID, ensuring the method's robustness.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
main_data |
str
|
Path to a CSV file containing main data was which |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame: A DataFrame containing the fetched assay properties, |
|
including columns for AID, Assay Type, Activity Name, SourceName, |
|
SourceID, Name, and Description. This DataFrame is saved to |
|
'Data/Nodes/Assay_Properties.csv' in the current working directory. |
Raises:
Type | Description |
---|---|
ValueError
|
If the input CSV file is empty or does not contain the 'AID' column. |
Example
extractor = NodePropertiesExtractor(['CYP2D6', 'CYP3A4']) extractor.create_data_directories() extractor.run() assay_properties_df = extractor.extract_assay_properties('Data/AllDataConnected.csv') print(assay_properties_df.head())
This example reads assay IDs from 'Data/AllDataConnected.csv', queries PubChem for their detailed properties, and compiles the results into a DataFrame, which is also saved to 'Data/Nodes/Assay_Properties.csv'.
Note
This method requires network access to the PubChem API and assumes the availability of a valid 'AID' column in the input CSV file. Ensure the input file path is correct and accessible to avoid errors during processing.
Source code in chemgraphbuilder/node_properties_extractor.py
483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 |
|
extract_compound_properties(main_data, start_chunk=0)
Extracts and aggregates compound properties from PubChem for a list of compounds associated with specific genes.
This method processes a CSV file specified by main_data
, which contains
gene identifiers and their associated compound IDs (CIDs). It selects
compounds related to the top n
most frequently occurring genes in the
dataset, where n
is determined by the instance's _enzyme_count
attribute. The method then fetches detailed compound properties from
PubChem in chunks, using concurrent requests to improve efficiency and
manage the load on the PubChem API. The fetched compound properties are
aggregated into a single DataFrame and saved to multiple CSV files,
one for each chunk of compound IDs processed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
main_data |
str
|
Path to a CSV file containing main data was which |
required |
Side Effects
- Saves the aggregated compound properties to CSV files in the current
working directory. The files are named
'Data/Nodes/Compound_Properties/Chunk_{i}.csv', where
{i}
is the chunk index.
Returns:
Name | Type | Description |
---|---|---|
None |
This method does not return a value. Instead, it saves the |
|
fetched compound data directly to CSV files. |
Raises:
Type | Description |
---|---|
Exception
|
Logs an error and continues processing the next CID if |
Example
extractor = NodePropertiesExtractor(['CYP2D6', 'CYP3A4']) extractor.create_data_directories() extractor.extract_compound_properties('Data/AllDataConnected.csv') This will read 'Data/AllDataConnected.csv', filter for compounds associated with the top n genes, fetch their properties from PubChem, and save the results into multiple CSV files for each chunk of compounds processed.
Note
- Ensure that the 'main_data' CSV file exists and is accessible at the specified path.
- The method automatically handles NaN values in the 'CID' column and excludes them from processing.
- The
enzyme_count
attribute determines the number of top genes for which compound properties will be fetched. - Internet access is required to fetch compound data from the PubChem API.
- The method employs a
ThreadPoolExecutor
with a configurable number of workers (default is len(enzyme_list)) to parallelize requests, which can be adjusted based on system capabilities and API rate limits.
Source code in chemgraphbuilder/node_properties_extractor.py
741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 |
|
extract_gene_properties(main_data)
Extracts and processes gene properties from a given data source, specifically targeting genes relevant to the study (e.g., CYP enzymes) and records their details in a structured DataFrame.
This method reads gene data from a CSV file specified by main_data
,
queries the PubChem database for additional properties of each unique
gene ID found in the file, and compiles these properties into a new
DataFrame. It focuses on fetching details like gene symbols, taxonomy,
taxonomy IDs, and synonyms for each gene. The final DataFrame is filtered
to include only genes of particular interest (e.g., certain CYP enzymes)
and saved to a separate CSV file for further analysis or use.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
main_data |
str
|
Path to a CSV file containing main data was which |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame: A DataFrame containing the compiled gene properties, |
|
including GeneID, Symbol, Taxonomy, Taxonomy ID, and Synonyms, |
|
filtered to include only specified genes of interest. This DataFrame |
|
is also saved to 'Data/Nodes/Gene_Properties.csv'. |
Raises:
Type | Description |
---|---|
Exception
|
If there's an issue reading the initial CSV file or |
Example
extractor = NodePropertiesExtractor(['CYP2D6', 'CYP3A4']) extractor.create_data_directories() extractor.run() gene_properties_df = extractor.extract_gene_properties('Data/AllDataConnected.csv') print(gene_properties_df.head())
This would read gene IDs from 'Data/AllDataConnected.csv', fetch their properties from PubChem, and compile the details into a DataFrame, filtering for specified genes of interest and saving the results to 'Data/Nodes/Gene_Properties.csv'.
Note
The method filters the resulting DataFrame to include only genes with symbols in the predefined enzyme_list. Adjust this list as necessary to match the focus of your study or application.
Source code in chemgraphbuilder/node_properties_extractor.py
366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 |
|
extract_protein_properties(main_data)
Extracts and compiles protein properties from the NCBI protein database based on accession numbers.
Given a CSV file specified by main_data
, this method reads protein
accession numbers and performs web scraping on the NCBI protein database
pages to extract protein titles. The method constructs a URL for
each accession number, sends a request to retrieve the page content,
and parses the HTML to find the protein title. The extracted titles,
along with their corresponding accession numbers and URLs, are
compiled into a DataFrame. This DataFrame is saved to a CSV file,
providing a structured summary of protein properties for further analysis or use.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
main_data |
str
|
Path to a CSV file containing main data was which |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame: A DataFrame with columns 'RefSeq Accession', 'URL', |
|
and 'Description', where 'Description' contains the title of the |
|
protein extracted from its NCBI page. This DataFrame is saved to |
|
'Data/Nodes/Protein_Properties.csv' in the current working directory. |
Raises:
Type | Description |
---|---|
Exception
|
If there's an issue reading the initial CSV file or |
Example
Assuming 'protein_data.csv' contains a column 'Target Accession' with accession numbers:
extractor = NodePropertiesExtractor(['CYP2D6', 'CYP3A4']) extractor.create_data_directories() extractor.run() # you need to run this only once protein_properties_df = extractor.extract_protein_properties('Data/AllDataConnected.csv') print(protein_properties_df.head())
This would read accession numbers from 'Data/AllDataConnected.csv', scrape their titles from the NCBI protein database, and compile the results into a DataFrame, which is also saved to 'Data/Nodes/Protein_Properties.csv'.
Note
This method requires internet access to query the NCBI protein database. Ensure the input file path is correct and accessible to avoid errors during processing. Web scraping is dependent on the structure of the web page; changes to the NCBI protein database pages may require updates to the scraping logic.
Source code in chemgraphbuilder/node_properties_extractor.py
573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 |
|
fetch_data(cid)
Retrieves detailed chemical compound properties for a specified Compound ID (CID) from the PubChem database.
This method constructs a query URL to fetch a wide range of properties for the given CID from PubChem, including molecular formula, molecular weight, canonical and isomeric SMILES, InChI codes, physicochemical properties, and more. If the CID is valid and data is available, it returns a pandas DataFrame containing these properties. This method also generates a URL to retrieve the structure image of the compound as a 2D PNG image, adding it as a column in the DataFrame. In cases where the CID is NaN or an error occurs during data retrieval, an empty DataFrame is returned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cid |
int or float
|
The Compound ID for which to fetch data. |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame: A DataFrame containing the fetched properties for the |
|
given CID. The DataFrame includes columns for each property fetched |
|
from PubChem, along with a 'StructureImage2DURL' column containing |
|
the URL to the compound's structure image. Returns an empty DataFrame |
|
if the CID is NaN or if any error occurs during the fetch operation. |
Raises:
Type | Description |
---|---|
Exception
|
Logs an error message if the request to PubChem fails or |
Example
extractor = NodePropertiesExtractor(['CYP2D6', 'CYP3A4']) extractor.create_data_directories() compound_data_df = extractor.fetch_data(2244) print(compound_data_df.head())
This example fetches the properties for the compound with CID 2244 from PubChem and prints the first few rows of the resulting DataFrame.
Note
This method requires an active internet connection to access the PubChem database. Ensure that the CID provided is valid and not NaN to avoid fetching errors. The structure and availability of data fields are subject to the current state of the PubChem database and may vary.
Source code in chemgraphbuilder/node_properties_extractor.py
668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 |
|
get_enzyme_assays(enzyme)
Fetches assay data for a specified enzyme from the PubChem database and returns it as a pandas DataFrame.
This method constructs a URL to query the PubChem database for concise assay data related to the given enzyme. It processes the CSV response into a DataFrame, which includes various assay data points provided by PubChem.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
enzyme |
str
|
The name of the enzyme for which assay data is |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame: A DataFrame containing the assay data fetched from |
|
PubChem for the specified enzyme. The DataFrame includes columns |
|
based on the CSV response from PubChem, such as assay ID, results, |
|
and conditions. Returns None if no data is available or if an error |
|
occurs during data fetching or processing. |
Raises:
Type | Description |
---|---|
RequestException
|
If an error occurs during the HTTP |
EmptyDataError
|
If the response from PubChem contains no data. |
Example
extractor = NodePropertiesExtractor(['enzyme']) enzyme_assays_df = extractor.get_enzyme_assays('enzyme') print(enzyme_assays_df.head())
Source code in chemgraphbuilder/node_properties_extractor.py
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
|
run()
Orchestrates the process of fetching, filtering, and aggregating assay data from PubChem for a predefined list of enzymes.
This method iteratively queries PubChem for assay data corresponding
to each enzyme specified in the enzyme_list
attribute during class
initialization. It performs the following steps for each enzyme:
1. Constructs a query URL and fetches assay data from PubChem.
2. Filters the fetched data based on predefined criteria
(e.g., containing specific substrings in the assay name).
3. Aggregates the filtered data into a single pandas DataFrame.
4. Identifies enzymes for which data could not be fetched or were
excluded based on filtering criteria, logging their names.
The final aggregated DataFrame, containing assay data for all successfully processed enzymes, is then saved to a CSV file. This method facilitates the extraction and preprocessing of chemical assay data for further analysis or integration into knowledge graphs.
Note
- This method relies on the successful response from PubChem for each enzyme query.
- Enzymes with no available data or failing to meet the filtering criteria are excluded from the final DataFrame.
- The output CSV file is saved in the current working directory with the name 'Data/AllDataConnected.csv'.
Returns:
Type | Description |
---|---|
pd.DataFrame: A DataFrame containing the aggregated and filtered |
|
assay data for the specified enzymes. Columns in the DataFrame |
|
correspond to the assay data fields returned by PubChem, subject to |
|
the filtering criteria applied within this method. |
Raises:
Type | Description |
---|---|
RequestException
|
If there is an issue with fetching data |
Example
Assuming enzyme_list
was set to ['CYP2D6', 'CYP3A4'] during
class initialization:
extractor = NodePropertiesExtractor(['CYP2D6', 'CYP3A4']) extractor.create_data_directories() result_df = extractor.run() print(result_df.head())
This will fetch and process assay data for 'CYP2D6' and 'CYP3A4', returning a DataFrame with the processed data.
Source code in chemgraphbuilder/node_properties_extractor.py
252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 |
|
4. Node Data Processor
node_data_processor.py
This module provides the NodeDataProcessor class, which is responsible for preprocessing various types of node data (assays, proteins, genes, and compounds) for use in chemical knowledge graph construction. The preprocessing includes renaming columns, consolidating multiple files, and saving the processed data in a consistent format. This step ensures uniformity and ease of access for subsequent data analysis and integration processes.
Classes:
Name | Description |
---|---|
NodeDataProcessor |
Handles preprocessing of assay, protein, gene, and compound data. |
Example Usage
processor = NodeDataProcessor(data_dir='path/to/data') processor.preprocess_assays() processor.preprocess_proteins() processor.preprocess_genes() processor.preprocess_compounds()
NodeDataProcessor
NodeDataProcessor is responsible for preprocessing various types of node data (assays, proteins, genes, and compounds) by renaming columns, consolidating multiple files, and saving the processed data. This preprocessing step is crucial for ensuring uniformity and ease of access in subsequent analysis and integration processes.
Attributes:
Name | Type | Description |
---|---|---|
data_dir |
str
|
The directory where the node data files are stored. |
Methods:
Name | Description |
---|---|
preprocess_assays |
Processes and renames columns in assay data. |
preprocess_proteins |
Processes and renames columns in protein data. |
preprocess_genes |
Processes and renames columns in gene data. |
preprocess_compounds |
Consolidates and renames columns in compound data. |
Source code in chemgraphbuilder/node_data_processor.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
|
__init__(data_dir)
Initializes the NodeDataProcessor with a directory path to manage the data files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_dir |
str
|
The directory where the node data files are stored. |
required |
Source code in chemgraphbuilder/node_data_processor.py
46 47 48 49 50 51 52 53 |
|
preprocess_assays()
Processes the assay data by renaming columns and saving the modified data back to disk. This method also handles visualization of assay data distributions if necessary.
Source code in chemgraphbuilder/node_data_processor.py
56 57 58 59 60 61 62 63 64 65 66 |
|
preprocess_compounds()
Concatenates multiple CSV files containing compound data into a single file, renames columns for uniformity, and saves the consolidated data. This method facilitates easier management and analysis of compound data.
Source code in chemgraphbuilder/node_data_processor.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
|
preprocess_genes()
Processes gene data by renaming columns and changing data types for specific fields. The processed data is saved for further use in gene-related analyses.
Source code in chemgraphbuilder/node_data_processor.py
80 81 82 83 84 85 86 87 88 89 90 |
|
preprocess_proteins()
Processes the protein data by renaming columns and saving the processed data. This method simplifies access to protein data for downstream analysis.
Source code in chemgraphbuilder/node_data_processor.py
69 70 71 72 73 74 75 76 77 |
|
5. Add Graph Nodes
Module for adding node data from CSV files to a Neo4j database.
This module provides a class and methods to read node data from CSV files and add them to a Neo4j database, including creating uniqueness constraints and generating Cypher queries.
AddGraphNodes
Bases: Neo4jBase
A class used to add node data from a CSV file or a directory of CSV files to a Neo4j database.
Methods:
create_uniqueness_constraint(driver, label, unique_property): Create a uniqueness constraint for the unique property of nodes in Neo4j. generate_cypher_queries(node_dict, label, unique_property): Generate Cypher queries to update nodes in Neo4j based on the data from the CSV file. execute_queries(queries): Execute a list of provided Cypher queries against the Neo4j database. read_csv_file(file_path, unique_property): Read data from a CSV file and extract node properties. combine_csv_files(input_directory): Combine multiple CSV files with the same columns into a single DataFrame. process_and_add_nodes(file_path, label, unique_property): Process the CSV file and add node data to the Neo4j database. process_and_add_nodes_from_directory(directory_path, label, unique_property): Combine CSV files from a directory and add node data to the Neo4j database.
Source code in chemgraphbuilder/add_graph_nodes.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 |
|
__init__(driver)
Initializes the AddGraphNodes class with a Neo4j driver.
Parameters:
driver : neo4j.GraphDatabase.driver A driver instance to connect to the Neo4j database.
Source code in chemgraphbuilder/add_graph_nodes.py
40 41 42 43 44 45 46 47 48 49 50 51 |
|
combine_csv_files(input_directory)
Combine multiple CSV files with the same columns into a single DataFrame.
Parameters:
input_directory : str The directory containing the CSV files to be combined.
Returns:
DataFrame A combined DataFrame containing data from all the CSV files.
Source code in chemgraphbuilder/add_graph_nodes.py
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
|
create_uniqueness_constraint(driver, label, unique_property)
staticmethod
Create a uniqueness constraint for the unique property of nodes in Neo4j.
Parameters:
driver : neo4j.GraphDatabase.driver A driver instance to connect to the Neo4j database. label : str The label of the node. unique_property : str The unique property of the node.
Source code in chemgraphbuilder/add_graph_nodes.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
execute_queries(queries)
Execute the provided list of Cypher queries against the Neo4j database.
Parameters:
queries : list A list of Cypher query strings to execute.
Source code in chemgraphbuilder/add_graph_nodes.py
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
|
generate_cypher_queries(node_dict, label, unique_property)
Generate Cypher queries for updating Neo4j based on the provided node data dictionary.
Parameters:
node_dict : dict A dictionary with unique identifiers as keys and node data as values. label : str The label of the node. unique_property : str The unique property of the node.
Yields:
str A Cypher query string.
Source code in chemgraphbuilder/add_graph_nodes.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
process_and_add_nodes(file_path, label, unique_property)
Process the CSV file and add node data to the Neo4j database.
Parameters:
file_path : str The path to the CSV file. label : str The label of the node. unique_property : str The unique property of the node.
Source code in chemgraphbuilder/add_graph_nodes.py
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
|
process_and_add_nodes_from_directory(directory_path, label, unique_property)
Combine CSV files from a directory and add node data to the Neo4j database.
Parameters:
directory_path : str The path to the directory containing the CSV files. label : str The label of the node. unique_property : str The unique property of the node.
Source code in chemgraphbuilder/add_graph_nodes.py
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
|
public_generate_property_string(value)
Public method to access the protected _generate_property_string method for testing.
Parameters:
value : Any The value to be formatted.
Returns:
str The formatted property string.
Source code in chemgraphbuilder/add_graph_nodes.py
238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 |
|
read_csv_file(file_path, unique_property)
Read data from a CSV file and extract node properties.
Parameters:
file_path : str The path to the CSV file. unique_property : str The column name that serves as the unique identifier for the nodes.
Returns:
dict A dictionary with unique identifiers as keys and extracted data as values.
Source code in chemgraphbuilder/add_graph_nodes.py
148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
|
6. Relationship Properties Extractor
This module defines the RelationshipPropertiesExtractor
class, which is responsible for extracting and analyzing
relationship properties among compounds, genes, and assays from the PubChem database.
The class facilitates the retrieval of complex relational data between chemical entities, enabling detailed analysis of biochemical interactions and properties. The extracted data is ideal for constructing knowledge graphs, supporting drug discovery, and understanding genetic influences on compound behavior.
Classes:
Name | Description |
---|---|
- RelationshipPropertiesExtractor |
A class to extract and analyze relationship properties from PubChem. |
Usage Example
extractor = RelationshipPropertiesExtractor() extractor.assay_compound_relationship("Data/AllDataCollected.csv") This example fetches assay-compound relationship data for specified assays and saves the data to CSV files.
Note
Ensure network access to the PubChem API for data retrieval.
RelationshipPropertiesExtractor
Extracts and analyzes relationship properties among compounds, genes, and assays from the PubChem database.
This class facilitates the retrieval of complex relational data between chemical entities, enabling detailed analysis of biochemical interactions and properties. The extracted data is ideal for constructing knowledge graphs, supporting drug discovery, and understanding genetic influences on compound behavior.
Methods within the class are tailored to query specific relationship types from PubChem, including compound-assay relationships, compound co-occurrences, and compound transformations influenced by genes. Data fetched from PubChem is processed and saved in structured formats (CSV files), ready for further analysis or database integration.
Attributes:
Name | Type | Description |
---|---|---|
session |
Session
|
Session object to persist certain parameters |
Usage
extractor = RelationshipPropertiesExtractor() extractor.assay_compound_relationship("Data/AllDataCollected.csv") This example fetches assay-compound relationship data for specified assays and saves the data to CSV files.
Source code in chemgraphbuilder/relationship_properties_extractor.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 |
|
__init__()
Initializes a RelationshipPropertiesExtractor with a Requests session for efficient network calls.
Source code in chemgraphbuilder/relationship_properties_extractor.py
65 66 67 68 |
|
assay_compound_relationship(assays_data, start_chunk=0)
Processes and stores relationships between assays and compounds based on assay data from PubChem.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
assays_data |
str
|
Path to a CSV file containing assay IDs (AIDs). |
required |
start_chunk |
int
|
The starting index for processing chunks. |
0
|
Source code in chemgraphbuilder/relationship_properties_extractor.py
204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
|
assay_gene_relationship(main_data)
Extracts and saves relationships between assays and proteins from the specified dataset.
This method processes assay data to identify relationships between assays and their target proteins. It selects relevant columns from the input data, removes duplicates to ensure unique relationships, and saves the cleaned data to a CSV file for further analysis or integration into knowledge graphs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
main_data |
str
|
Path to the CSV file containing the main data. The |
required |
Returns:
Type | Description |
---|---|
pandas.DataFrame: A DataFrame containing the unique relationships |
|
between assays and proteins, including the assay ID, target gene ID, |
|
and activity name. |
Side Effects
- Writes a CSV file to 'Data/Relationships/Assay_Gene_Relationship.csv', containing the processed relationships data.
Source code in chemgraphbuilder/relationship_properties_extractor.py
238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 |
|
compound_compound_cooccurrence(main_data, rate_limit=5)
Analyzes compound-compound co-occurrence relationships from the specified main data file and saves the results into structured CSV files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
main_data |
str
|
Path to the main data file. |
required |
rate_limit |
int
|
The maximum number of requests per second. |
5
|
Returns:
Name | Type | Description |
---|---|---|
str |
A message indicating the completion of data fetching and saving. |
Source code in chemgraphbuilder/relationship_properties_extractor.py
594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 |
|
compound_gene_cooccurrence(gene_data, rate_limit=5)
Analyzes compound-gene co-occurrence relationships from the specified main data file and saves the results into structured CSV files.
Source code in chemgraphbuilder/relationship_properties_extractor.py
640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 |
|
compound_gene_interaction(gene_data, rate_limit=5)
Analyzes compound-gene co-occurrence relationships from the specified main data file and saves the results into structured CSV files.
Source code in chemgraphbuilder/relationship_properties_extractor.py
679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 |
|
compound_gene_relationship(main_data)
Identifies and records relationships between compounds and proteins from the input data.
This method focuses on extracting compound-protein interaction data, including activity outcomes and values. It selects pertinent columns, removes duplicate records, and sorts the data by Compound ID and Target Accession for clarity. The cleaned dataset is then saved to a CSV file, providing a structured view of how compounds interact with various proteins, which can be critical for drug discovery and pharmacological research.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
main_data |
str
|
Path to the CSV file with compound and protein data. |
required |
Returns:
Type | Description |
---|---|
pandas.DataFrame: A DataFrame with processed compound-protein |
|
relationships, sorted and cleaned for direct analysis or database |
|
insertion. |
Side Effects
- Saves the processed relationships data to 'Data/Relationships/Compound_Gene_Relationship.csv', facilitating easy access and integration.
Source code in chemgraphbuilder/relationship_properties_extractor.py
303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 |
|
compound_similarity_relationship(main_data, start_chunk=0)
Identifies and records the similarity relationships between compounds based on a list of CIDs. The similarity is detrmined by the Tanimoto similarity coefficient with threshold 95% to ensure highe structural similarity.
This method reads a CSV file containing compound data, filters compounds based on specific 'Target GeneID' values, and fetches similar CIDs for each compound. The compounds are processed in chunks to manage memory usage and improve efficiency. The results are saved into separate CSV files for each chunk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
main_data |
str
|
Path to the CSV file containing the main compound data. |
required |
start_chunk |
int
|
The starting index for processing chunks. |
0
|
Note: - The method filters the main data for compounds associated with specific 'Target GeneID' values before fetching similar CIDs, optimizing the process for relevant compounds only. - The division of CIDs into chunks and concurrent processing helps in managing large datasets and utilizes parallelism for faster execution.
Source code in chemgraphbuilder/relationship_properties_extractor.py
422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 |
|
compound_transformation(gene_properties)
Analyzes compound transformation data based on gene properties, focusing on metabolic transformations involving specified genes. This method queries the PubChem database for transformation data related to compounds associated with the genes identified in the provided CSV file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
gene_properties |
str
|
Path to the CSV file containing gene properties |
required |
Processing Steps
- Reads the provided CSV file to extract unique gene identifiers.
- For each gene identifier, constructs a query to fetch relevant compound transformation data from PubChem, focusing on metabolic transformations where the gene plays a role.
- Processes and aggregates the fetched data into a structured pandas DataFrame.
- Filters the aggregated data to retain specific columns relevant to compound transformations, including substrate and metabolite Compound IDs (CIDs), the type of metabolic conversion, gene identifiers, PubMed IDs, and DOIs for related publications.
- Saves the aggregated and filtered DataFrame to a CSV file for further analysis or integration into knowledge graphs or other data models.
Returns:
Type | Description |
---|---|
pandas.DataFrame: A DataFrame containing processed compound |
|
transformation data, including substrate and metabolite CIDs, |
|
metabolic conversion types, gene identifiers, PubMed IDs, and DOIs. |
|
The DataFrame structure facilitates further analysis or use in |
|
constructing detailed views of metabolic pathways involving the |
|
specified genes. |
Side Effects
- Saves the aggregated compound transformation data to 'Data/Relationships/Compound_Transformation.csv' in the current working directory. This file captures the relationship between substrates, metabolites, and genes based on the input gene properties.
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If the specified 'gene_properties' file does not |
ValueError
|
If 'gene_properties' does not contain the required |
Example
extractor = RelationshipPropertiesExtractor() transformation_df = extractor.compound_transformation('Data/Nodes/gene_properties.csv') print(transformation_df.head()) This example processes gene properties from 'path/to/gene_properties.csv', queries PubChem for compound transformation data related to the genes, and compiles the results into a DataFrame.
Note
The method assumes that the input 'gene_properties' file is accessible and correctly formatted. The availability and structure of the PubChem database may affect the completeness and accuracy of the fetched transformation data. Users should verify the existence of the 'Data/Relationships' directory and have appropriate permissions to write files to it.
Source code in chemgraphbuilder/relationship_properties_extractor.py
724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 |
|
fetch_data_for_aid(aid, columns_to_remove)
Fetches and processes assay data for a specified Assay ID (AID) from the PubChem database, preparing it for analysis or further processing.
This method queries the PubChem database for assay data associated with a given AID. It constructs the query URL, sends the request using a previously established session, and processes the response. The response is expected to be in CSV format, which this method reads into a pandas DataFrame. Specific columns can be removed from this DataFrame based on the requirements for analysis. This allows for the customization of the fetched data, making it easier to work with specific datasets.
If the request is successful and the data is fetched without issues, it undergoes initial processing to remove unwanted columns as specified by the 'columns_to_remove' parameter. In case of an error during the data fetching or processing (e.g., issues with parsing the CSV data), appropriate error messages are logged, and an empty DataFrame is returned as a fallback.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
aid |
int
|
The assay ID for which data is to be fetched. This ID is |
required |
columns_to_remove |
list of str
|
A list of column names that should |
required |
Returns:
Type | Description |
---|---|
pandas.DataFrame: A DataFrame containing the processed data |
|
associated with the given AID. The DataFrame will exclude columns |
|
listed in 'columns_to_remove'. If the data fetching fails or if |
|
an error occurs during processing, an empty DataFrame is returned. |
Raises:
Type | Description |
---|---|
RequestException
|
If an error occurs during the HTTP request |
ParserError
|
If an error occurs while parsing the CSV |
Example
extractor = RelationshipPropertiesExtractor() processed_data_df = extractor.fetch_data_for_aid(12345, ['UnwantedColumn1', 'UnwantedColumn2']) print(processed_data_df.head()) This example demonstrates how to fetch and process assay data for the assay with ID 12345, removing 'UnwantedColumn1' and 'UnwantedColumn2' from the resulting DataFrame. The first few rows of the processed DataFrame are printed as an output.
Note
- This method is part of a class that requires a valid session with the PubChem API. Ensure that the class is properly initialized and that the session is active.
- The removal of columns is an optional step and can be customized based on the analysis needs. If no columns need to be removed, pass an empty list as 'columns_to_remove'.
Source code in chemgraphbuilder/relationship_properties_extractor.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
|
fetch_similar_cids(cid)
Fetches similar compound IDs (CIDs) from the PubChem database for a given compound ID (CID) using 2D similarity.
This method queries the PubChem database to find compounds that are similar to the given CID based on 2D structural similarity. The similarity threshold is set to 95%, and a maximum of 100 similar CIDs are fetched. The response is parsed from XML format to extract the similar CIDs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cid |
int
|
The compound ID for which similar CIDs are to be fetched. |
required |
Returns:
Name | Type | Description |
---|---|---|
tuple |
A tuple containing the original CID and a list of similar |
|
CIDs. If an error occurs, the list of similar CIDs will be empty. |
Raises:
Type | Description |
---|---|
Exception
|
Logs an error message with the original CID and the |
Note
- The method utilizes the
requests
library for HTTP requests andxml.etree.ElementTree
for XML parsing. - In case of a request failure or parsing error, the method logs the error and returns the original CID with an empty list, allowing the calling function to handle the exception as needed.
Source code in chemgraphbuilder/relationship_properties_extractor.py
344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 |
|
gene_protein_relationship(main_data)
Extracts and saves relationships between genes and proteins based on the provided dataset.
This method selects relevant columns to highlight the relationships between genes and their corresponding proteins. It removes duplicate entries to ensure that each relationship is represented uniquely and saves the resultant data to a CSV file. This facilitates easy integration of genetic data into knowledge bases or further analysis.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
main_data |
str
|
Path to the CSV file containing gene and protein data. |
required |
Returns:
Type | Description |
---|---|
pandas.DataFrame: A DataFrame of unique gene-protein relationships, |
|
including gene ID and protein accession numbers. |
Side Effects
- Writes the processed data to 'Data/Gene_Protein_Relationship.csv' in a structured CSV format.
Source code in chemgraphbuilder/relationship_properties_extractor.py
271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 |
|
process_chunk(chunk)
Processes a chunk of CIDs in parallel to fetch similar CIDs for each CID in the chunk.
This method uses a ThreadPoolExecutor to send out concurrent requests for
fetching similar CIDs for a list of CIDs.
The number of worker threads is set to 5. Each CID's request is handled
by fetch_similar_cids
method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk |
list of int
|
A list of compound IDs (CIDs) to process in |
required |
Returns:
Type | Description |
---|---|
list of tuples: A list of tuples, each containing a CID and its |
|
corresponding list of similar CIDs. |
Side Effects
- Utilizes concurrent threads to speed up the fetching process.
- May log errors if any occur during the fetching of similar CIDs for individual CIDs.
Source code in chemgraphbuilder/relationship_properties_extractor.py
393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 |
|
7. Relationship Data Processor
RelationshipDataProcessor
A class to process relationship data files, filtering and augmenting the data.
Attributes:
Name | Type | Description |
---|---|---|
path |
str
|
The directory path where the data files are stored. |
csv_files |
list
|
List of CSV files matching the pattern 'AID_*.csv'. |
all_data_connected |
dict
|
A dictionary containing additional data connected to assays. |
Source code in chemgraphbuilder/relationship_data_processor.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 |
|
__init__(path, start_chunk=0)
Initializes the RelationshipDataProcessor with the specified path and start chunk index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The directory path containing the CSV files. |
required |
start_chunk |
int
|
The starting index for processing chunks. |
0
|
Source code in chemgraphbuilder/relationship_data_processor.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
|
most_frequent(row)
staticmethod
Finds the most frequent value in a row, excluding NaN values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
row |
Series
|
A row from a DataFrame. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
The most frequent value in the row. |
Source code in chemgraphbuilder/relationship_data_processor.py
313 314 315 316 317 318 319 320 321 322 323 324 325 326 |
|
process_files()
Processes the CSV files by filtering, cleaning, and augmenting data.
The processed data is saved to output files.
Source code in chemgraphbuilder/relationship_data_processor.py
219 220 221 222 223 224 225 226 |
|
propagate_phenotype(group)
staticmethod
Propagates the phenotype information within a group.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
group |
DataFrame
|
A DataFrame group. |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame: The updated group with propagated phenotype information. |
Source code in chemgraphbuilder/relationship_data_processor.py
328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 |
|
8. Add Graph Relationships
Module to set up a data directory with a predefined structure.
This module provides the DataFolderSetup class, which creates a directory structure for a data folder. The structure includes nodes and relationships folders with specified subfolders.
Classes:
Name | Description |
---|---|
DataFolderSetup |
Class to set up a data directory with a predefined structure. |
Functions:
Name | Description |
---|---|
main |
Main function to set up the data directory. |
SetupDataFolder
Class to set up a data directory with a predefined structure.
Attributes:
Name | Type | Description |
---|---|---|
data_folder |
str
|
The name of the data folder. |
base_path |
str
|
The base path for the data directory. |
structure |
dict
|
The structure of directories to create. |
Source code in chemgraphbuilder/setup_data_folder.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
__init__()
Initializes the DataFolderSetup with the data folder name and directory structure.
Source code in chemgraphbuilder/setup_data_folder.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
create_folder(path)
staticmethod
Creates a folder if it does not already exist.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The path of the folder to create. |
required |
Source code in chemgraphbuilder/setup_data_folder.py
45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
setup()
Sets up the data directory structure based on the predefined structure.
Source code in chemgraphbuilder/setup_data_folder.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
main()
Main function to set up the data directory.
Source code in chemgraphbuilder/setup_data_folder.py
78 79 80 81 82 83 |
|