NASA's People Knowledge Graph: A Revolutionary Approach to People Analytics

Have you ever wondered how NASA identifies its leading experts, assembles high-performing teams, and plans for the skill sets of the future? The answer lies in their innovative initiative known as the People Knowledge Graph!

This groundbreaking project is reshaping the landscape of people analytics at NASA, leveraging the capabilities of graph databases combined with the power of large language models (LLMs).

If you missed out on the recent community call, dont worry! You can catch the full NASA x Memgraph session on demand, where you can take a closer look at their architecture, enjoy a live demonstration, and engage in an expert Q&A.

Introduction

This engaging community call featured key members of NASAs People Analytics team, who are the driving force behind the development of the People Knowledge Graph:

David Meza, Branch Chief of People Analytics and Head of Human Capital Analytics at NASA
Madison Ostermann, Data Scientist and Data Engineer
Katherine Knott, Data Scientist

During the session, they shared insights into how they integrated graph databases, LLMs, and secure Amazon Web Services (AWS) infrastructure to effectively connect people, projects, and the skills prevalent across the agency.

The outcome is an advanced graph-powered system that facilitates the discovery of subject matter experts, conducts project similarity analyses, and generates real-time organizational insights. All of this information is readily accessible through Cypher queries and a user-friendly chatbot interface powered by GraphRAG.

Key Technical Takeaways from the Live Demo

1. The Necessity of a People Knowledge Graph

People data stored within traditional relational databases often appears disorganized and unwieldy, characterized by rows, columns, and joins that fail to effectively represent the intricate relationships found in large organizations like NASA.

Graph databases, on the other hand, present a solution by establishing intuitive connections between individuals, their respective skills, project involvement, and even career trajectories. This allows NASA to swiftly answer essential questions such as:

Who has worked on highly similar AI projects across various NASA centers?
Which employees possess cross-disciplinary expertise in AI/ML?
Where are the skill gaps that we need to address?

Nasa employs Memgraph to manage all of this data in real-time, enabling their team to query and explore multi-hop relationships fluidly.

2. Infrastructure and Deployment

The complete system operates on NASAs secure internal AWS cloud, with several key components:

Memgraph running in Docker on EC2 instances.
An on-premises LLM server (Olama) also deployed in EC2 for skill extraction and chatbot functionalities.
AWS S3 buckets utilized for storing both structured and unstructured data.
GQLAlchemy used for ingesting data from S3 into Memgraph via Cypher.

The enterprise license from Memgraph allows NASA to segment data across multiple databases, ensuring that personally identifiable information (PII) remains secure and isolated.

3. Data Ingestion and Skill Extraction

The team efficiently aggregated data from various sources into Memgraph, including:

Personnel Data from NASAs internal Personnel Data Warehouse
AI/ML Project Data from the AI Use Case Registry
Extracted Skills from Team Resumes

For AI/ML project data, the team computed cosine similarity between project descriptions to establish relationships, with similarity metrics serving as properties. Additionally, resume data underwent processing using Ollama to extract skills automatically, negating the need for manually tagged datasets. These skills were subsequently linked to employees as nodes within the graph.

4. Graph Schema and Modeling

Nasa constructed a labeled property graph containing nodes that encapsulate:

Employees
Position Titles
Occupation Series
Pay Grades
Organizations (Mission Support Enterprise Organization - MISO)
Centers
Projects, with associated descriptions represented as node properties (including unstructured text)
Levels of Education
Universities attended
Instructional Program Majors
Extracted Skills

All nodes were classified under the label "Entity" to facilitate vector indexing and support GraphRAG (Graph Retrieval-Augmented Generation).

5. Highlights from the Live People Graph Demo

During the community call, the team demonstrated real Cypher queries on a sample dataset (with anonymized PII), enabling them to address various types of inquiries relevant to NASA:

Subject Matter Expert Identification: Tailored to pinpoint employees with expertise in specific domains or mission-critical capabilities.
Leadership Report-Out Descriptive Queries: Crafted to provide leadership with high-level metrics by analyzing workforce composition, capability distribution, and organizational dynamics.
Project Overlap Analysis: Engineered to identify near-duplicate projects based on similarity scores.

Moreover, they provided a sneak peek into a RAG-based chatbot that allows users to query the graph using natural language.

6. LLM-Powered RAG Pipeline

The RAG-based chatbot, designed by NASA atop the graph, operates seamlessly in the following manner:

The LLM extracts key information from user inquiries.
A Modified Pivot Search is executed on each piece of key information independently, returning multiple pertinent nodes.
Relevance expansion occurs from these multiple relevant nodes, beginning at each relevant node and traversing a desired number of hops. The result consists of the start node, end node, and relationship details, referred to as "context triplets."

GraphRAG provides Ollama with these context triplets along with the original question to develop a context-aware response to the users inquiry. Moreover, embeddings are stored directly in Memgraph and indexed using cosine similarity. The system is continually being refined, with plans to test re-ranking and enhance embedding models.

7. Limitations and Future Directions

This project is an ongoing endeavor that continues to evolve daily. Currently, the graph encompasses approximately 27,000 nodes and 230,000 edges, with ambitions to expand significantly. Future enhancements will target:

Improving data quality and disambiguation (for instance, mapping "JS" to "JavaScript")
Automating the data pipeline
Broadening the graph to incorporate employee learning objectives, preferred project types, and skill classifications
Refining Cypher generation and RAG accuracy through a model context protocol (MCP)

Nasa's ultimate goal is to scale the People Graph to accommodate over 500,000 nodes and millions of edges.

Q&A Summary

The community call concluded with a Q&A session, during which participants posed various questions that were paraphrased for conciseness in this summary. For more comprehensive insights, be sure to watch the entire video.

Frequently Asked Questions

1. How did you transition unstructured text into a keyword classification?

Madison: Before implementing LLMs, our team relied on custom Named Entity Recognition with spaCy models, which required labor-intensive, manually tagged training datasets. Thanks to LLMs, we have streamlined this information extraction process.

2. Did you establish graph relationships (schema) in advance, or were they developed iteratively?

Madison: We began with an intuitive grasp of how individuals are connected to their attributes and work. David: Knowledge graphs offer flexibility; a full schema isnt necessary upfront. We identified known relationships and added latent connections as we delved into the data.

3. What led you to choose Memgraph over alternatives like Neo4j?

David: Having used various graph databases over the years, I initially favored Neo4j, particularly its label property graphs. However, Neo4js costs were prohibitive for our environment. After discovering Memgraph, which operates similarly but at a significantly lower cost, I made the transition. The familiarity of Cypher with Memgraph also aided in a smoother transition.

4. How did you establish relationships based on cosine similarity?

Madison: We compared project nodes by examining the embeddings of their descriptions and formed relationships based on an established cosine similarity threshold.

5. Are you storing data attributes on relationships, such as proficiency levels or years of experience?

Madison: While we have some relationship attributes, we still need to explore quantifying aspects like experience further.

6. How do you manage data duplication and ambiguity, such as variations in skill names like 'JavaScript' and 'JS'?

David: LLMs facilitate comprehension of these variations. We also utilize prompt engineering and context awareness to guide the LLM to attain consistent representations.