9.5 KiB
Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)
Introduction
About the Course and Learning Path
Make better software, faster
Milestone: Getting Started
Understanding Operations in Context
Section Introduction
What Is Ops?
-
GCP Defined
-
"Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"
-
Logging Management
- Gather Logs, metrics and traces everywhere
- Audit, Platform, User logs
- Export, Discard, Ingest
- Audit, Platform, User logs
- Gather Logs, metrics and traces everywhere
-
Error Reporting
- So much data, How do you pick out the important indicators?
- A centralized error management interface that shows current & past errors
- Identify your app's top or new errors at a glance, in a dedicated dashboard
- Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace
- So much data, How do you pick out the important indicators?
-
Across-the-board Monitoring
- Dashboards for built-in and customizable visualisations
- Monitoring Features
- Visual Dashboards
- Monitoring Features
- Health Monitoring
- Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
- Service Monitoring
- Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)
- Dashboards for built-in and customizable visualisations
-
SRE Tracking
- Monitoring is critical for SRE
- Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
- Pinpoint SLI's and develop and SLO on top of it
- Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
- Monitoring is critical for SRE
-
Operational Management
- Debugging
- Inspects the state of your application at any code location in production without stopping or slowing down requests
- Latency Management
- Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
- Performance Management
- Offers continuous profiling of resource consumption in your production applications along with cost management
- Security Management
- With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud
- Debugging
-
What is Ops: Key Takeaways
- Ops Defined: Watch, learn and fix
- Primary services: Monitoring and Logging
- Monitoring dashboads for all metrics, including health and services (SLOs)
- Logs can be exported, discarded, or ingested
- SRE depends on ops
- Error alerting pinpoints problems, quickly
Scratch:
- Metric query and tracing analysis
- Establish performance and reliability indicators
- Trigger alerts and error reporting
- Logging Features
- Error Reporting
- SRE Tracking (SLI/SLO)
- Performance Management
Clarifying the Stackdriver/Operations Connection
- 2012 - Stackdriver Created
- 2014 - Stackdriver Acquired by Google
- 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
- 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name
Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs (formerly all called Stackdriver )
"StackDriver" lives on - in the exam only
Integration + Upgrades
- Complete UI Integrations
- All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
- Dashboard API
- New API added to allow creation and sharing of dashoards across projects
- Log Retention Increased
- Logs can now be retained for up to 10 years and you have control over the time specified
- Metrics Enhancement
- In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
- Advanced Alert Routing
- Alerts can now be routed to independent systems that support Cloud Pub/Sub
Operations and SRE: How Do They Relate?
- Lots of questions in Exam on SRE
What is SRE? - "SRE is what happens when a software engineer is tasked with what used to be called operations" (Founder Google SRE Team)
Pillars of DevOps
-
Accept failure as normal:
- Try to anticipate, but...
- Incidents bound to occur
- Failures help team learn
-
No-fault postmortems & SLOs:
- No two failures the same
- Track incidents (SLIs)
- Map to Objectives (SLOs)
-
Implement gradual change:
- Small updates are better
- Easier to review
- Easier to rollback
-
Reduce costs of failures:
- Limited "canary" rollouts
- Impact fewest users
- Automate where possible
-
Measure everything:
- Critical guage of sucess
- CI/CD needs full monitoring
- Synthetic, proactive monitoring
-
Measure toil and reliability:
- Key to SLOs and SLAs
- Reduce toil, up engineering
- Monitor all over time
SLI: "A carefully defined quantitative measure of some aspect of the level of service that is provided"
SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing
Example SLIs:
- Request Latency: How long it takes to return a response to a request
- Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)
- Batch Throughput - Proportion of time = data processing rate > than a threshold
Commit to Memory - Google's 4x Golden Signals!
- Latency
- The time is takes for your service to fulfill a request
- Errors
- The rate at which your service fails
- Traffic
- How much demand is directed at your service
- Saturation
- A measure of how close to fully utilized the services' resources are
LETS
SLO: "Service level objectives (SLOs) specify a target level for the reliability of your service" - The site reliability workbook
SLOs are tied to you SLIs
- Measured by SLLI
- Can be a single target value or range of values
- SLIs <= SLO
- or
- (lower bound <= SLI <= upper bound) = SLO
- Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)
SLI - Metric over time which detail the health of a service
- example:
Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile
SLO - Agreed-upon bounds how often SLIs must be met
- example:
95% percentile homepage SLI will suceed 99.9% of the time over the next year
Phases of Service Lifetime
SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:
- Measure and track SLIs (Measuring increasing performance)
- Evaluate reliability
- Define SLOs
- Build capacity models
- Establish incident response, shared with dev team
General Availability Phase
- After Production Readiness Review passed
- SREs handle majority of op work
- Incident responses
- Track operational load and SLOs
Ops & SRE: Key Takeaways
- SRE: Operations from a software engineer
- Many shared pillars between DevOps/SRE
- SLIs are quantitative metrics over time
- Remember the 4x Google Golden Signals (LETS)
- SLOs are a target objective for reliability
- SLIs are lower then SLO - or - in-between upper and lower bound
- SREs are most active in limited availability and general availability phases
Operation Services at a Glance
Section Review
Milestone: The Weight of the World (Teamwork, Not Superheroes)
Monitoring Your Operations Section Introduction Cloud Monitoring Concepts Monitoring Workspaces Concepts Monitoring Workspaces Perspective: Workspaces in Context What Are Metrics? Exploring Workspace and Metrics Monitoring Agent Concepts Installing the Monitoring Agent Collecting Monitoring Agent Metrics Integration with Monitoring API Create Dashboards with Command Line GKE Metrics Perspective: What's Up, Doc? Uptime Checks Establishing Human-Actionable and Automated Alerts Section Review Milestone: Spies Everywhere! (Check Those Vitals!) Hands-On Lab: Install and Configure Monitoring Agent with Google Cloud Monitoring Logging Activities Section Introduction Cloud Logging Fundamentals Log Types and Mechanics Cloud Logging Tour Logging Agent Concepts Install Logging Agent and Collect Agent Logs Logging Filters Hands-On with Advanced Filters VPC Flow Logs Firewall Logs VPC Flow Logs and Firewall Logs Demo Routing and Exporting Logs Export Logs to BigQuery Logs-Based Metrics Section Review Milestone: Let the Record Show Hands-On Lab: Install and Configure Logging Agent on Google Cloud SRE and Alerting Policies SLOs and Alerting Strategy Service Monitoring Milestone: Come Together, Right Now, SRE Optimize Performance with Trace/Profiler Section Introduction What the Services Do and Why They Matter Tracking Latency with Cloud Trace Accessing the Cloud Trace APIs Setting Up Your App with Cloud Profiler Analyzing Cloud Profiler Data Section Review Milestone: It All Adds Up! Hands-On Lab: Discovering Latency with Google Cloud Trace Identifying Application Errors with Debug/Error Reporting Section Introduction Troubleshooting with Cloud Debugger Establishing Error Reporting for Your App Managing Errors and Handling Notifications Section Review Milestone: Come Together - Reprise (Debug Is De Solution) Hands-On Lab: Correcting Code with Cloud Debugger Course Conclusion Milestone: Are We There, Yet? landscape Practice Exam / Quiz: Google Certified Professional Cloud DevOps Engineer Exam Prep